from:"Joe Conway"

Re: First draft of PG 17 release notes

2024-05-11 Thread Joe Conway

On 5/11/24 09:57, Jelte Fennema-Nio wrote:

On Fri, 10 May 2024 at 23:31, Tom Lane  wrote:

Bruce Momjian  writes:
> I looked at both of these.   In both cases I didn't see why the user
> would need to know these changes were made:

I agree that the buffering change is not likely interesting, but
the fact that you can now control-C out of a psql "\c" command
is user-visible.  People might have internalized the fact that
it didn't work, or created complicated workarounds.

The buffering change improved performance up to ~40% in some of the
benchmarks. The case it improves mostly is COPY of large rows and
streaming a base backup. That sounds user-visible enough to me to
warrant an entry imho.

+1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: 'trusted'/'untrusted' PL in DoD/DISA PostgreSQL STIGs

2024-05-05 Thread Joe Conway


On 5/5/24 13:53, Chapman Flack wrote:

The four STIGs suggest the same email address [5] for comments or
proposed revisions. I could send these comments there myself, but
I thought it likely that others in the community have already been
involved in the development of those documents and might have better
connections.


Those docs were developed by the respective companies (Crunchy and EDB) 
in cooperation with DISA. The community has nothing to do with them. I 
suggest you contact the two companies with corrections and suggestions.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-05-03 Thread Joe Conway


On 5/3/24 11:44, Peter Eisentraut wrote:

On 03.05.24 16:13, Tom Lane wrote:

Peter Eisentraut  writes:

On 30.04.24 19:29, Tom Lane wrote:

Also, the bigger picture here is the seeming assumption that "if
we change pg_trgm then it will be safe to replicate from x86 to
arm".  I don't believe that that's a good idea and I'm unwilling
to promise that it will work, regardless of what we do about
char signedness.  That being the case, I don't want to invest a
lot of effort in the signedness issue.  Option (1) is clearly
a small change with little if any risk of future breakage.



But note that option 1 would prevent some replication that is currently
working.


The point of this thread though is that it's working only for small
values of "work".  People are rightfully unhappy if it seems to work
and then later they get bitten by compatibility problems.

Treating char signedness as a machine property in pg_control would
signal that we don't intend to make it work, and would ensure that
even the most minimal testing would find out that it doesn't work.

If we do not do that, it seems to me we have to buy into making
it work.  That would mean dealing with the consequences of an
incompatible change in pg_trgm indexes, and then going through
the same dance again the next time(s) similar problems are found.


Yes, that is understood.  But anecdotally, replicating between x86-64 arm64 is 
occasionally used for upgrades or migrations.  In practice, this appears to have 
mostly worked.  If we now discover that it won't work with certain index 
extension modules, it's usable for most users. Even if we say, you have to 
reindex everything afterwards, it's probably still useful for these scenarios.


+1

I have heard similar anecdotes, and the reported experience goes even further -- 
many such upgrade/migration uses, with exceedingly rare reported failures.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: New GUC autovacuum_max_threshold ?

2024-04-26 Thread Joe Conway


On 4/26/24 09:31, Robert Haas wrote:

On Fri, Apr 26, 2024 at 9:22 AM Joe Conway  wrote:

Although I don't think 50 is necessarily too small. In my view,
having autovac run very quickly, even if more frequently, provides an
overall better user experience.


Can you elaborate on why you think that? I mean, to me, that's almost
equivalent to removing autovacuum_vacuum_scale_factor entirely,
because only for very small tables will that calculation produce a
value lower than 500k.


If I understood Nathan's proposed calc, for small tables you would still 
get (thresh + sf * numtuples). Once that number exceeds the new limit 
parameter, then the latter would kick in. So small tables would retain 
the current behavior and large enough tables would be clamped.



We might need to try to figure out some test cases here. My intuition
is that this is going to vacuum large tables insanely aggressively.


It depends on workload to be sure. Just because a table is large, it 
doesn't mean that dead rows are generated that fast.


Admittedly it has been quite a while since I looked at all this that 
closely, but if A/V runs on some large busy table for a few milliseconds 
once every few minutes, that is far less disruptive than A/V running for 
tens of seconds once every few hours or for minutes ones every few days 
-- or whatever. The key thing to me is the "few milliseconds" runtime. 
The short duration means that no one notices an impact, and the longer 
duration almost guarantees that an impact will be felt.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: New GUC autovacuum_max_threshold ?

2024-04-26 Thread Joe Conway


On 4/26/24 04:43, Michael Banck wrote:

So this proposal (probably along with a higher default threshold than
50, but IMO less than what Robert and Nathan suggested) sounds like
a stop forward to me. DBAs can set the threshold lower if they want, or
maybe we can just turn it off by default if we cannot agree on a sane
default, but I think this (using the simplified formula from Nathan) is
a good approach that takes some pain away from autovacuum tuning and
reserves that for the really difficult cases.


+1 to the above

Although I don't think 50 is necessarily too small. In my view, 
having autovac run very quickly, even if more frequently, provides an 
overall better user experience.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Table AM Interface Enhancements

2024-04-10 Thread Joe Conway


On 4/10/24 09:19, Robert Haas wrote:

When you commit a patch and another committer writes a post-commit
review saying that your patch has so many serious problems that he
gave up on reviewing before enumerating all of them, that's a really
bad sign. That should be a strong signal to you to step back and take
a close look at whether you really understand the area of the code
that you're touching well enough to be doing whatever it is that
you're doing. If I got a review like that, I would have reverted the
patch instantly, given up for the release cycle, possibly given up on
the patch permanently, and most definitely not tried again to commit
unless I was absolutely certain that I'd learned a lot in the meantime
*and* had the agreement of the committer who wrote that review (or
maybe some other committer who was acknowledged as an expert in that
area of the code).





It's not Andres's job to make sure my patches are not broken. It's my
job. That applies to the patches I write, and the patches written by
other people that I commit. If I commit something and it turns out
that it is broken, that's my bad. If I commit something and it turns
out that it does not have consensus, that is also my bad. It is not
the fault of the other people for not helping me get my patches to a
state where they are up to project standard. It is my fault, and my
fault alone, for committing something that was not ready. Now that
does not mean that it isn't frustrating when I can't get the help I
need. It is extremely frustrating. But the solution is not to commit
anyway and then blame the other people for not providing feedback.


+many

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Security lessons from liblzma

2024-04-09 Thread Joe Conway


On 4/8/24 22:57, Bruce Momjian wrote:

On Sat, Mar 30, 2024 at 04:50:26PM -0400, Robert Haas wrote:

An awful lot of what we do operates on the principle that we know the
people who are involved and trust them, and I'm glad we do trust them,
but the world is full of people who trusted somebody too much and
regretted it afterwards. The fact that we have many committers rather
than a single maintainer probably reduces risk at least as far as the
source repository is concerned, because there are more people paying
attention to potentially notice something that isn't as it should be.


One unwritten requirement for committers is that we are able to
communicate with them securely.  If we cannot do that, they potentially
could be forced by others, e.g., governments, to add code to our
repositories.

Unfortunately, there is on good way for them to communicate with us
securely once they are unable to communicate with us securely.  I
suppose some special word could be used to communicate that status ---
that is how it was done in non-electronic communication in the past.


I don't know how that really helps. If one of our committers is under 
duress, they probably cannot risk outing themselves anyway.


The best defense, IMHO, is the fact that our source code is open and can 
be reviewed freely.


The trick is to get folks to do the review.

I know, for example, at $past_employer we had a requirement to get 
someone on our staff to review every single commit in order to maintain 
certain certifications. Of course there is no guarantee that such 
reviews would catch everything, but maybe we could establish post commit 
reviews by contributors in a more rigorous way? Granted, getting more 
qualified volunteers is not a trivial problem...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-08 Thread Joe Conway


On 4/8/24 11:05, Tom Lane wrote:

Pavel Borisov  writes:

IMO the fact that people struggle to work on patches, and make them better,
etc. is an immense blessing for the Postgres community. Is the peak of
commits really a big problem provided we have 6 months before actual
release? I doubt March patches tend to be worse than the November ones.


Yes, it's a problem, and yes the average quality of last-minute
patches is visibly worse than that of patches committed in a less
hasty fashion.  We have been through this every year for the last
couple decades, seems like, and we keep re-learning that lesson
the hard way.  I'm just distressed at our utter failure to learn
from experience.



I don't dispute that we could do better, and this is just a simplistic 
look based on "number of commits per day", but the attached does put it 
in perspective to some extent.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Security lessons from liblzma

2024-03-31 Thread Joe Conway


On 3/31/24 11:49, Tom Lane wrote:

Joe Conway  writes:
I am saying maybe those patches should be eliminated in favor of our 
tree including build options that would produce the same result.


I don't really see how that can be expected to work sanely.
It turns the responsibility for platform-specific build issues
on its head, and it doesn't work at all for issues discovered
after we make a release.  The normal understanding of how you
can vet a distro's package is that you look at the package
contents (the SRPM in Red Hat world and whatever the equivalent
concept is elsewhere), check that the contained tarball
matches upstream and that the patches and build instructions
look sane, and then build it locally and check for a match to
the distro's binary package.  Even if we could overcome the
obstacles to putting the patch files into the upstream tarball,
we're surely not going to include the build instructions, so
we'd not have moved the needle very far in terms of whether the
packager could do something malicious.


True enough I guess.

But it has always bothered me how many patches get applied to the 
upstream tarballs by the OS package builders. Some of them, e.g. glibc 
on RHEL 7, include more than 1000 patches that you would have to 
manually vet if you cared enough and had the skills. Last time I looked 
at the openssl package sources it was similar in volume and complexity. 
They might as well be called forks if everyone were being honest about it...


I know our PGDG packages are no big deal compared to that, fortunately.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Security lessons from liblzma

2024-03-31 Thread Joe Conway


On 3/30/24 21:52, Bruce Momjian wrote:

On Sat, Mar 30, 2024 at 07:54:00PM -0400, Joe Conway wrote:

Virtually every RPM source, including ours, contains out of tree patches
that get applied on top of the release tarball. At least for the PGDG
packages, it would be nice to integrate them into our git repo as build
options or whatever so that the packages could be built without any patches
applied to it. Add a tarball that is signed and traceable back to the git
tag, and we would be in a much better place than we are now.


How would someone access the out-of-tree patches?  I think Debian
includes the patches in its source tarball.


I am saying maybe those patches should be eliminated in favor of our 
tree including build options that would produce the same result.


For example, these patches are applied to our release tarball files when 
the RPM is being built for pg16 on RHEL 9:


-
https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-rpm-pgsql.patch;h=d9b6d12b7517407ac81352fa325ec91b05587641;hb=HEAD

https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-var-run-socket.patch;h=f2528efaf8f4681754b20283463eff3e14eedd39;hb=HEAD

https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-conf.patch;h=da28ed793232316dd81fdcbbe59a6479b054a364;hb=HEAD

https://git.postgresql.org/gitweb/?p=pgrpms.git;a=blob;f=rpm/redhat/main/non-common/postgresql-16/main/postgresql-16-perl-rpath.patch;h=748c42f0ec2c9730af3143e90e5b205c136f40d9;hb=HEAD
-

Nothing too crazy, but wouldn't it be better if no patches were required 
at all?


Ideally we should have reproducible builds so that starting with our 
tarball (which is traceable back to the git release tag) one can easily 
obtain the same binary as what the RPMs/DEBs deliver.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Security lessons from liblzma

2024-03-30 Thread Joe Conway


On 3/30/24 19:54, Joe Conway wrote:

On 2024-03-30 16:50:26 -0400, Robert Haas wrote:

or what Tom does when he builds the release tarballs.


Tom follows this, at least last time I checked:

https://wiki.postgresql.org/wiki/Release_process


Reading through that, I wonder if this part is true anymore:

  In principle this could be done anywhere, but again there's a concern
  about reproducibility, since the results may vary depending on
  installed bison, flex, docbook, etc versions. Current practice is to
  always do this as pgsql on borka.postgresql.org, so it can only be
  done by people who have a login there. In detail:

Maybe if we split out the docs from the release tarball, we could also 
add the script (mk-release) to our git repo?


Some other aspects of that wiki page look out of date too. Perhaps it 
needs an overall update? Maybe Tom and/or Magnus could weigh in here.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Security lessons from liblzma

2024-03-30 Thread Joe Conway


On 3/30/24 17:12, Andres Freund wrote:

Hi,

On 2024-03-30 16:50:26 -0400, Robert Haas wrote:

We might also want to move toward signing commits and tags. One of the
meson maintainers was recommending that on-list not long ago.


I don't know how valuable the commit signing really is, but I strongly agree
that we should sign both tags and tarballs.


+1



We should think about weaknesses that might occur during the packaging
process, too. If someone who alleges that their packaging PG is really
packaging PG w/badstuff123.patch, how would we catch that?


I don't think we realistically can. The environment, configure and compiler
options will influence things too much to do any sort of automatic
verification.

But that shouldn't stop us from ensuring that at least the packages
distributed via *.postgresql.org are reproducibly built.

Another good avenue for introducing an attack would be to propose some distro
specific changes to the packaging teams - there's a lot fewer eyes there.  I
think it might be worth working with some of our packagers to integrate more
of their changes into our tree.


Huge +1 to that. I have thought many times, and even said to Devrim, a 
huge number of people trust him to not be evil.


Virtually every RPM source, including ours, contains out of tree patches 
that get applied on top of the release tarball. At least for the PGDG 
packages, it would be nice to integrate them into our git repo as build 
options or whatever so that the packages could be built without any 
patches applied to it. Add a tarball that is signed and traceable back 
to the git tag, and we would be in a much better place than we are now.



I can't for example verify what the infrastructure team is doing,


Not sure what you feel like you should be able to follow -- anything 
specific?



or what Tom does when he builds the release tarballs.


Tom follows this, at least last time I checked:

https://wiki.postgresql.org/wiki/Release_process


This one however, I think we could improve upon. Making sure the tarball
generation is completely binary reproducible and providing means of checking
that would surely help. This will be a lot easier if we, as dicussed
elsewhere, I believe, split out the generated docs into a separately
downloadable archive. We already stopped including other generated files
recently.


again, big +1


We shouldn't make the mistake of assuming that bad things can't happen to
us.


+1


and again with the +1 ;-)

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-25 Thread Joe Conway


On 3/25/24 11:12, Tom Lane wrote:

"Amonson, Paul D"  writes:

I am re-posting the patches as CI for Mac failed (CI error not code/test 
error). The patches are the same as last time.


Just for a note --- the cfbot will re-test existing patches every
so often without needing a bump.  The current cycle period seems to
be about two days.



Just an FYI -- there seems to be an issue with all three of the macos 
cfbot runners (mine included). I spent time over the weekend working 
with Thomas Munro (added to CC list) trying different fixes to no avail. 
Help from macos CI wizards would be gratefully accepted...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Possibility to disable `ALTER SYSTEM`

2024-03-19 Thread Joe Conway

On 3/19/24 07:49, Andrew Dunstan wrote:

On Tue, Mar 19, 2024 at 5:26 AM Heikki Linnakangas <mailto:hlinn...@iki.fi>> wrote:

I want to remind everyone of this from Gabriele's first message that
started this thread:

 > At the moment, a possible workaround is that `ALTER SYSTEM` can
be blocked
 > by making the postgresql.auto.conf read only, but the returned
message is
 > misleading and that’s certainly bad user experience (which is very
 > important in a cloud native environment):
 >
 >
 > ```
 > postgres=# ALTER SYSTEM SET wal_level TO minimal;
 > ERROR:  could not open file "postgresql.auto.conf": Permission denied
 > ```

I think making the config file read-only is a fine solution. If you
don't want postgres to mess with the config files, forbid it with the
permission system.

Problems with pg_rewind, pg_basebackup were mentioned with that
approach. I think if you want the config files to be managed outside
PostgreSQL, by kubernetes, patroni or whatever, it would be good for
them to be read-only to the postgres user anyway, even if we had a
mechanism to disable ALTER SYSTEM. So it would be good to fix the
problems with those tools anyway.

The error message is not great, I agree with that. Can we improve it?
Maybe just add a HINT like this:

postgres=# ALTER SYSTEM SET wal_level TO minimal;
ERROR:  could not open file "postgresql.auto.conf" for writing:
Permission denied
HINT:  Configuration might be managed outside PostgreSQL

Perhaps we could make that even better with a GUC though. I propose a
GUC called 'configuration_managed_externally = true / false". If you
set
it to true, we prevent ALTER SYSTEM and make the error message more
definitive:

postgres=# ALTER SYSTEM SET wal_level TO minimal;
ERROR:  configuration is managed externally

As a bonus, if that GUC is set, we could even check at server startup
that all the configuration files are not writable by the postgres user,
and print a warning or refuse to start up if they are.

(Another way to read this proposal is to rename the GUC that's been
discussed in this thread to 'configuration_managed_externally'. That
makes it look less like a security feature, and describes the intended
use case.)

I agree with pretty much all of this.

+1 me too.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improving contrib/tablefunc's error reporting

2024-03-09 Thread Joe Conway


On 3/9/24 15:39, Tom Lane wrote:

Joe Conway  writes:

On 3/9/24 13:07, Tom Lane wrote:

BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.


Something like the attached what you had in mind? (applies on top of 
your two patches)


Yeah, exactly.

As far as the comment change goes:

-   * - attribute [1] of the sql tuple is the category; no need to check it -
-   * attribute [2] of the sql tuple should match attributes [1] to [natts]
+   * attribute [1] of the sql tuple is the category; no need to check it
+   * attribute [2] of the sql tuple should match attributes [1] to [natts - 1]
 * of the return tuple

I suspect that this block looked better when originally committed but
didn't survive contact with pgindent.  You should check whether your
version will; if not, some dashes on the /* line will help.


Thanks for the review and heads up. I fiddled with it a bit to make it 
pgindent clean. I saw you commit your patches so I just pushed mine.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improving contrib/tablefunc's error reporting

2024-03-09 Thread Joe Conway


On 3/9/24 13:07, Tom Lane wrote:

Joe Conway  writes:

On 3/5/24 17:04, Tom Lane wrote:

After reading the thread at [1], I could not escape the feeling
that contrib/tablefunc's error reporting is very confusing.



The changes all look good to me and indeed more consistent with the docs.
Do you want me to push these? If so, development tip  only, or backpatch?


I can push that.  I was just thinking HEAD, we aren't big on changing
error reporting in back branches.


BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.



I can take a look at this. Presumably this would not be for backpatching.


I'm not sure whether that could produce results bad enough to be
called a bug or not.  But I too would lean towards not back-patching,
in view of the lack of field complaints.



Something like the attached what you had in mind? (applies on top of 
your two patches)


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Make tablefunc crosstab() check typmod too

tablefunc connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID. Fix that by makeing
the crosstab() check look more like the connectby() check.

--- tablefunc.c.v0002	2024-03-09 14:38:29.285393890 -0500
+++ tablefunc.c	2024-03-09 14:43:47.021399855 -0500
@@ -1527,10 +1527,10 @@
 compatCrosstabTupleDescs(TupleDesc ret_tupdesc, TupleDesc sql_tupdesc)
 {
 	int			i;
-	Form_pg_attribute ret_attr;
 	Oid			ret_atttypid;
-	Form_pg_attribute sql_attr;
 	Oid			sql_atttypid;
+	int32		ret_atttypmod;
+	int32		sql_atttypmod;
 
 	if (ret_tupdesc->natts < 2)
 		ereport(ERROR,
@@ -1539,34 +1539,39 @@
  errdetail("Return row must have at least two columns.")));
 	Assert(sql_tupdesc->natts == 3);	/* already checked by caller */
 
-	/* check the rowid types match */
+	/* check the row_name types match */
 	ret_atttypid = TupleDescAttr(ret_tupdesc, 0)->atttypid;
 	sql_atttypid = TupleDescAttr(sql_tupdesc, 0)->atttypid;
-	if (ret_atttypid != sql_atttypid)
+	ret_atttypmod = TupleDescAttr(ret_tupdesc, 0)->atttypmod;
+	sql_atttypmod = TupleDescAttr(sql_tupdesc, 0)->atttypmod;
+	if (ret_atttypid != sql_atttypid ||
+		(ret_atttypmod >= 0 && ret_atttypmod != sql_atttypmod))
 		ereport(ERROR,
 (errcode(ERRCODE_DATATYPE_MISMATCH),
  errmsg("invalid crosstab return type"),
  errdetail("Source row_name datatype %s does not match return row_name datatype %s.",
-		   format_type_be(sql_atttypid),
-		   format_type_be(ret_atttypid;
+		   format_type_with_typemod(sql_atttypid, sql_atttypmod),
+		   format_type_with_typemod(ret_atttypid, ret_atttypmod;
 
 	/*
-	 * - attribute [1] of the sql tuple is the category; no need to check it -
-	 * attribute [2] of the sql tuple should match attributes [1] to [natts]
+	 * attribute [1] of the sql tuple is the category; no need to check it
+	 * attribute [2] of the sql tuple should match attributes [1] to [natts - 1]
 	 * of the return tuple
 	 */
-	sql_attr = TupleDescAttr(sql_tupdesc, 2);
+	sql_atttypid = TupleDescAttr(sql_tupdesc, 2)->atttypid;
+	sql_atttypmod = TupleDescAttr(sql_tupdesc, 2)->atttypmod;
 	for (i = 1; i < ret_tupdesc->natts; i++)
 	{
-		ret_attr = TupleDescAttr(ret_tupdesc, i);
-
-		if (ret_attr->atttypid != sql_attr->atttypid)
+		ret_atttypid = TupleDescAttr(ret_tupdesc, i)->atttypid;
+		ret_atttypmod = TupleDescAttr(ret_tupdesc, i)->atttypmod;
+		if (ret_atttypid != sql_atttypid ||
+		(ret_atttypmod >= 0 && ret_atttypmod != sql_atttypmod))
 			ereport(ERROR,
 	(errcode(ERRCODE_DATATYPE_MISMATCH),
 	 errmsg("invalid crosstab return type"),
 	 errdetail("Source value datatype %s does not match return value datatype %s in column %d.",
-			   format_type_be(sql_attr->atttypid),
-			   format_type_be(ret_attr->atttypid),
+			   format_type_with_typemod(sql_atttypid, sql_atttypmod),
+			   format_type_with_typemod(ret_atttypid, ret_atttypmod),
 			   i + 1)));
 	}

Re: Improving contrib/tablefunc's error reporting

2024-03-09 Thread Joe Conway


On 3/5/24 17:04, Tom Lane wrote:

After reading the thread at [1], I could not escape the feeling
that contrib/tablefunc's error reporting is very confusing.
Looking into the source code, I soon found that it is also
very inconsistent, with similar error reports being phrased
quite differently.  The terminology for column names doesn't
match the SGML docs either.  And there are some places that are
so confused about whether they are complaining about the calling
query or the called query that the output is flat-out backwards.
So at that point my nascent OCD wouldn't let me rest without
fixing it.  Here's a quick patch series to do that.

For review purposes, I split this into two patches.  0001 simply
adds some more test cases to reach currently-unexercised error
reports.  Then 0002 makes my proposed code changes and shows
how the existing error messages change.

I'm not necessarily wedded to the phrasings I used here,
in case anyone has better ideas.


The changes all look good to me and indeed more consistent with the docs.

Do you want me to push these? If so, development tip  only, or backpatch?


BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.


I can take a look at this. Presumably this would not be for backpatching.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2024-03-08 Thread Joe Conway


On 3/8/24 12:28, Andrey M. Borodin wrote:

Hello everyone!

Thanks for working on this, really nice feature!


On 9 Jan 2024, at 01:40, Joe Conway  wrote:

Thanks -- will have a look


Joe, recently folks proposed a lot of patches in this thread that seem like 
diverted from original way of implementation.
As an author of CF entry [0] can you please comment on which patch version 
needs review?



I don't know if I agree with the proposed changes, but I have also been 
waiting to see how the parallel discussion regarding COPY extensibility 
shakes out.


And there were a couple of issues found that need to be tracked down.

Additionally I have had time/availability challenges recently.

Overall, chances seem slim that this will make it into 17, but I have 
not quite given up hope yet either.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improving contrib/tablefunc's error reporting

2024-03-05 Thread Joe Conway


On 3/5/24 17:04, Tom Lane wrote:

After reading the thread at [1], I could not escape the feeling
that contrib/tablefunc's error reporting is very confusing.
Looking into the source code, I soon found that it is also
very inconsistent, with similar error reports being phrased
quite differently.  The terminology for column names doesn't
match the SGML docs either.  And there are some places that are
so confused about whether they are complaining about the calling
query or the called query that the output is flat-out backwards.
So at that point my nascent OCD wouldn't let me rest without
fixing it.  Here's a quick patch series to do that.

For review purposes, I split this into two patches.  0001 simply
adds some more test cases to reach currently-unexercised error
reports.  Then 0002 makes my proposed code changes and shows
how the existing error messages change.

I'm not necessarily wedded to the phrasings I used here,
in case anyone has better ideas.

BTW, while I didn't touch it here, it seems fairly bogus that
connectby() checks both type OID and typmod for its output
columns while crosstab() only checks type OID.  I think
crosstab() is in the wrong and needs to be checking typmod.
That might be fit material for a separate patch though.


Been a long time since I gave contrib/tablefunc any love I guess ;-)

I will have a look at your patches, and the other issue you mention, but 
it might be a day or three before I can give it some quality time.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

PostgreSQL Contributors Updates

2024-03-03 Thread Joe Conway


All,

The PostgreSQL Contributor Page 
(https://www.postgresql.org/community/contributors/) includes people who 
have made substantial, long-term contributions of time and effort to the 
PostgreSQL project. The PostgreSQL Contributors Team recognizes the 
following people for their contributions.


New PostgreSQL Contributors:

* Bertrand Drouvot
* Gabriele Bartolini
* Richard Guo

New PostgreSQL Major Contributors:

* Alexander Lakhin
* Daniel Gustafsson
* Dean Rasheed
* John Naylor
* Melanie Plageman
* Nathan Bossart

Thank you and congratulations to all!

Thanks,
On behalf of the PostgreSQL Contributors Team

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: An improved README experience for PostgreSQL

2024-02-28 Thread Joe Conway


On 2/28/24 14:36, Tom Lane wrote:

Nathan Bossart  writes:

On Wed, Feb 28, 2024 at 01:08:03PM -0600, Nathan Bossart wrote:

-PostgreSQL Database Management System
-=
+# PostgreSQL Database Management System



This change can be omitted, which makes the conversion even simpler.


That's a pretty convincing proof-of-concept.  Let's just do this,
and then make sure to keep the file legible as plain text.


+1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: An improved README experience for PostgreSQL

2024-02-28 Thread Joe Conway


On 2/28/24 12:25, Daniel Gustafsson wrote:

On 28 Feb 2024, at 18:02, Nathan Bossart  wrote:

On Wed, Feb 28, 2024 at 02:46:11PM +0100, Daniel Gustafsson wrote:

On 26 Feb 2024, at 21:30, Tom Lane  wrote:
Nathan Bossart  writes:

I think this would be nice.  If the Markdown version is reasonably readable
as plain-text, maybe we could avoid maintaining two READMEs files, too.
But overall, +1 to modernizing the README a bit.


Per past track record, we change the top-level README only once every
three years or so, so I doubt it'd be too painful to maintain two
versions of it.


It wont be, and we kind of already have two since there is another similar
README displayed at https://www.postgresql.org/ftp/.  That being said, a
majority of those reading the README will likely be new developers accustomed
to Markdown (or doing so via interfaces such as Github) so going to Markdown
might not be a bad idea.  We can also render a plain text version with pandoc
for release builds should we want to.


Sorry, my suggestion wasn't meant to imply that I have any strong concerns
about maintaining two README files.  If we can automate generating one or
the other, that'd be great, but I don't see that as a prerequisite to
adding a Markdown version.


Agreed, and I didn't say we should do it but rather that we can do it based on
the toolchain we already have.  Personally I think just having a Markdown
version is enough, it's become the de facto standard for such documentation for
good reasons.


+1

Markdown is pretty readable as text, I'm not sure why we need both.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PGC_SIGHUP shared_buffers?

2024-02-19 Thread Joe Conway


On 2/19/24 13:13, Andres Freund wrote:

On 2024-02-19 09:19:16 -0500, Joe Conway wrote:

Couldn't we scale the rounding, e.g. allow small allocations as we do now,
but above some number always round? E.g. maybe >= 2GB round to the nearest
256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the nearest 1GB,
etc?


That'd make the translation considerably more expensive. Which is important,
given how common an operation this is.



Perhaps it is not practical, doesn't help, or maybe I misunderstand, but 
my intent was that the rounding be done/enforced when setting the GUC 
value which surely cannot be that often.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: PGC_SIGHUP shared_buffers?

2024-02-19 Thread Joe Conway


On 2/18/24 15:35, Andres Freund wrote:

On 2024-02-18 17:06:09 +0530, Robert Haas wrote:

How many people set shared_buffers to something that's not a whole
number of GB these days?


I'd say the vast majority of postgres instances in production run with less
than 1GB of s_b. Just because numbers wise the majority of instances are
running on small VMs and/or many PG instances are running on one larger
machine.  There are a lot of instances where the total available memory is
less than 2GB.


I mean I bet it happens, but in practice if you rounded to the nearest GB,
or even the nearest 2GB, I bet almost nobody would really care. I think it's
fine to be opinionated here and hold the line at a relatively large granule,
even though in theory people could want something else.


I don't believe that at all unfortunately.


Couldn't we scale the rounding, e.g. allow small allocations as we do 
now, but above some number always round? E.g. maybe >= 2GB round to the 
nearest 256MB, >= 4GB round to the nearest 512MB, >= 8GB round to the 
nearest 1GB, etc?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Replace current implementations in crypt() and gen_salt() to OpenSSL

2024-02-16 Thread Joe Conway


On 2/16/24 04:16, Daniel Gustafsson wrote:

On 15 Feb 2024, at 16:49, Peter Eisentraut  wrote:



1. All the block ciphers currently supported by crypt() and gen_salt() are not 
FIPS-compliant.

2. The crypt() and gen_salt() methods built on top of them (modes of operation, 
kind of) are not FIPS-compliant.


I wonder if it's worth trying to make pgcrypto disallow non-FIPS compliant
ciphers when the compiled against OpenSSL is running with FIPS mode enabled, or
raise a WARNING when used?  It seems rather unlikely that someone running
OpenSSL with FIPS=yes want to use our DES cipher without there being an error
or misconfiguration somewhere.

Something like the below untested pseudocode.

diff --git a/contrib/pgcrypto/pgcrypto.c b/contrib/pgcrypto/pgcrypto.c
index 96447c5757..3d4391ebe1 100644
--- a/contrib/pgcrypto/pgcrypto.c
+++ b/contrib/pgcrypto/pgcrypto.c
@@ -187,6 +187,14 @@ pg_crypt(PG_FUNCTION_ARGS)
   *resbuf;
text   *res;
  
+#if defined FIPS_mode

+   if (FIPS_mode())
+#else
+   if 
(EVP_default_properties_is_fips_enabled(OSSL_LIB_CTX_get0_global_default()))
+#endif
+   ereport(ERROR,
+   (errmsg("not available when using OpenSSL in FIPS 
mode")));


Makes sense +1

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgjdbc is not working with PKCS8 certificates with password

2024-02-07 Thread Joe Conway


On 2/7/24 06:42, just madhu wrote:

On further investigation,
/With certificate generated as below. JDBC connection is successful./
openssl pkcs8 -topk8 -inform PEM -in client.key -outform DER -out 
client.pk8  -passout pass:foobar / -v1 PBE-MD5-DES


But a connection from pgAdmin (connection failed: 
\SSLCerts\pk8_pass\client_pass_PBE.pk8": no start line) and psql(psql: 
error: could not load private key file "client_pass_PBE.pk8": 
unsupported) is failing


Is there a common way in which certificate with passwords can be 
created  for both libpq and jdbc ?



You may want to check with the pgjdbc project on github rather than (or 
in addition to?) here; see:


  https://github.com/pgjdbc/pgjdbc/issues

Joe

On Wed, Feb 7, 2024 at 3:17 PM just madhu <mailto:justvma...@gmail.com>> wrote:


Hi ,

postgresql-42.7.1.jar

Trying to use establish a connection using PKCS8 certificate created
with password.

/openssl pkcs8 -topk8 -inform PEM -in client.key -outform DER -out
client.pk8  -passout pass:foobar
/

I set the properties as below:
/.../
/sslProperties.setProperty("sslkey", "client.pk8");
sslProperties.setProperty("sslpassword","foobar");/
/.../
/Connection connection = DriverManager.getConnection(jdbcUrl,
sslProperties);
/
//
/This is failing with the error:/
/org.postgresql.util.PSQLException: SSL error: Connection reset
at org.postgresql.ssl.MakeSSL.convert(MakeSSL.java:43)
at

org.postgresql.core.v3.ConnectionFactoryImpl.enableSSL(ConnectionFactoryImpl.java:584)
at

org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:168)
/
/.../

Regards,
Madhu



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-01-15 Thread Joe Conway


On 1/12/24 20:17, Jeff Davis wrote:

On Fri, 2024-01-05 at 16:11 +0530, Ashutosh Bapat wrote:

I don't think we need to add a test for every FDW. E.g. adding a test
in file_fdw would be pointless. But postgres_fdw is special. The test
could further create a foreign table ftab_foo on subscriber
referencing foo on publisher and then compare the data from foo and
ftab_foo to make sure that the replication is happening. This will
serve as a good starting point for replicated tables setup in a
sharded cluster.



Attached updated patch set with added TAP test for postgres_fdw, which
uses a postgres_fdw server as the source for subscription connection
information.

I think 0004 needs a bit more work, so I'm leaving it off for now, but
I'll bring it back in the next patch set.


I took a quick scan through the patch. The only thing that jumped out at 
me was that it seems like it might make sense to use 
quote_literal_cstr() rather than defining your own appendEscapedValue() 
function?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: System username in pg_stat_activity

2024-01-10 Thread Joe Conway

On 1/10/24 08:59, Magnus Hagander wrote:

On Wed, Jan 10, 2024 at 2:56 PM Bertrand Drouvot

I think it depends what we want the new field to reflect. If it is the exact
same thing as the SYSTEM_USER then I think it has to be text (as the SYSTEM_USER
is made of "auth_method:identity"). Now if we want it to be "only" the identity
part of it, then the `name` datatype would be fine. I'd vote for the exact same
thing as the SYSTEM_USER (means auth_method:identity).

I definitely think it should be the same. If it's not exactly the
same, then it should be *two* columns, one with auth method and one
with the name.

And thinking more about it maybe that's cleaner, because that makes it
easier to do things like filter based on auth method?

+1 for the overall feature and +1 for two columns

> + 
> +  
> +   authname name
> +  
> +  
> +   The authentication method and identity (if any) that the user
> +   used to log in. It contains the same value as
> +returns in the backend.
> +  
> + 

I'm fine with auth_method:identity.

> +S.authname,

What about using system_user as the field name? (because if we keep
auth_method:identity it's not really the authname anyway).

I was worried system_user or sysuser would both be confusing with the
fact that we have usesysid -- which would reference a *different*
sys...

I think if it is exactly "system_user" it would be pretty clearly a 
match for SYSTEM_USER

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2024-01-08 Thread Joe Conway


On 1/8/24 14:36, Dean Rasheed wrote:

On Thu, 7 Dec 2023 at 01:10, Joe Conway  wrote:


The attached should fix the CopyOut response to say one column.



Playing around with this, I found a couple of cases that generate an error:

COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);

COPY (VALUES (1), (2)) TO stdout WITH (format json);

both of those generate the following:

ERROR:  record type has not been registered



Thanks -- will have a look

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Password leakage avoidance

2024-01-07 Thread Joe Conway


On 1/6/24 15:10, Tom Lane wrote:

Joe Conway  writes:
The only code specific comments were Tom's above, which have been 
addressed. If there are no serious objections I plan to commit this 
relatively soon.


I had not actually read this patchset before, but now I have, and
I have a few minor suggestions:


Many thanks!


* The API comment for PQchangePassword should specify that encryption
is done according to the server's password_encryption setting, and
probably the SGML docs should too.  You shouldn't have to read the
code to discover that.


Check


* I don't especially care for the useless initializations of
encrypted_password, fmtpw, and fmtuser.  In all those cases the
initial NULL is immediately replaced by a valid value.  Probably
the compiler will figure out that the initializations are useless,
but human readers have to do so as well.  Moreover, I think this
style is more bug-prone not less so, because if you ever change
the logic in a way that causes some code paths to fail to set
the variables, you won't get use-of-possibly-uninitialized-value
warnings from the compiler.

* Perhaps move the declaration of "buf" to the inner block where
it's actually used?


Makes sense -- fixed


* This could be shortened to just "return res":

+   if (!res)
+   return NULL;
+   else
+   return res;


Heh, apparently I needed more coffee at this point :-)


* I'd make the SGML documentation a bit more explicit about the
return value, say

+   Returns a PGresult pointer representing
+   the result of the ALTER USER command, or
+   a null pointer if the routine failed before issuing any command.


Fixed.

I also ran pgindent. I was kind of surprised/unhappy when it made me 
change this (which aligned the two var names):

8<
PQExpBufferDatabuf;
PGresult*res;
8<

to this (which leaves the var names unaligned):
8<
PQExpBufferDatabuf;
PGresult*res;
8<

Anyway, the resulting adjustments attached.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 850734a..28b861f 100644
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
*** PQclosePrepared   188
*** 191,193 
--- 191,194 
  PQclosePortal 189
  PQsendClosePrepared   190
  PQsendClosePortal 191
+ PQchangePassword  192
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 912aa14..dcb7c9f 100644
*** a/src/interfaces/libpq/fe-auth.c
--- b/src/interfaces/libpq/fe-auth.c
*** PQencryptPasswordConn(PGconn *conn, cons
*** 1372,1374 
--- 1372,1455 
  
  	return crypt_pwd;
  }
+ 
+ /*
+  * PQchangePassword -- exported routine to change a password
+  *
+  * This is intended to be used by client applications that wish to
+  * change the password for a user. The password is not sent in
+  * cleartext because it is encrypted on the client side. This is
+  * good because it ensures the cleartext password is never known by
+  * the server, and therefore won't end up in logs, pg_stat displays,
+  * etc. The password encryption is performed by PQencryptPasswordConn(),
+  * which is passed a NULL for the algorithm argument. Hence encryption
+  * is done according to the server's password_encryption
+  * setting. We export the function so that clients won't be dependent
+  * on the implementation specific details with respect to how the
+  * server changes passwords.
+  *
+  * Arguments are a connection object, the SQL name of the target user,
+  * and the cleartext password.
+  *
+  * Return value is the PGresult of the executed ALTER USER statement
+  * or NULL if we never get there. The caller is responsible to PQclear()
+  * the returned PGresult.
+  *
+  * PQresultStatus() should be called to check the return value for errors,
+  * and PQerrorMessage() used to get more information about such errors.
+  */
+ PGresult *
+ PQchangePassword(PGconn *conn, const char *user, const char *passwd)
+ {
+ 	char	   *encrypted_password = PQencryptPasswordConn(conn, passwd,
+ 		   user, NULL);
+ 
+ 	if (!encrypted_password)
+ 	{
+ 		/* PQencryptPasswordConn() already registered the error */
+ 		return NULL;
+ 	}
+ 	else
+ 	{
+ 		char	   *fmtpw = PQescapeLiteral(conn, encrypted_password,
+ 			strlen(encrypted_password));
+ 
+ 		/* no longer needed, so clean up now */
+ 		PQfreemem(encrypted_password);
+ 
+ 		if (!fmtpw)
+ 		{
+ 			/* PQescapeLiteral() already registered the error */
+ 			return NULL;
+ 		}
+ 		else
+ 		{
+ 			char	   *fmtuser = PQescapeIdentifier(conn, user, strlen(user));
+ 
+ 			if (!fmtuser)
+ 			{
+ /* PQescapeIdentifier() already registered th

Re: Password leakage avoidance

2024-01-06 Thread Joe Conway


On 1/6/24 13:16, Sehrope Sarkuni wrote:
On Sat, Jan 6, 2024 at 12:39 PM Joe Conway <mailto:m...@joeconway.com>> wrote:


The only code specific comments were Tom's above, which have been
addressed. If there are no serious objections I plan to commit this
relatively soon.


One more thing that we do in pgjdbc is to zero out the input password 
args so that they don't remain in memory even after being freed. It's 
kind of odd in Java as it makes the input interface a char[] and we have 
to convert them to garbage collected Strings internally (which kind of 
defeats the purpose of the exercise).


But in libpq could be done via something like:

memset(pw1, 0, strlen(pw1));
memset(pw2, 0, strlen(pw2));



That part is in psql not libpq

There was some debate on our end of where to do that and we settled on 
doing it inside the encoding functions to ensure it always happens. So 
the input password char[] always gets wiped regardless of how the 
encoding functions are invoked.


Even if it's not added to the password encoding functions (as that kind 
of changes the after effects if anything was relying on the password 
still having the password), I think it'd be good to add it to the 
command.c stuff that has the two copies of the password prior to freeing 
them.


While that change might or might not be worthwhile, I see it as 
independent of this patch.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Password leakage avoidance

2024-01-06 Thread Joe Conway


On 12/24/23 10:14, Joe Conway wrote:

On 12/23/23 11:00, Tom Lane wrote:

Joe Conway  writes:
The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).


Haven't really read the patch, just looked at the docs, but here's
a bit of bikeshedding:


Thanks!


* This seems way too eager to promote the use of md5.  Surely the
default ought to be SCRAM, full stop.  I question whether we even
need an algorithm parameter.  Perhaps it's a good idea for
future-proofing, but we could also plan that the function would
make its own decisions based on noting the server's version.
(libpq is far more likely to be au courant about what to do than
the calling application, IMO.)



* Parameter order seems a bit odd: to me it'd be natural to write
user before password.



* Docs claim return type is char *, but then say bool (which is
also what the code says).  We do not use bool in libpq's API;
the admittedly-hoary convention is to use int with 1/0 values.
Rather than quibble about that, though, I wonder if we should
make the result be the PGresult from the command, which the
caller would be responsible to free.  That would document error
conditions much more flexibly.  On the downside, client-side
errors would be harder to report, particularly OOM, but I think
we have solutions for that in existing functions.



* The whole business of nonblock mode seems fairly messy here,
and I wonder whether it's worth the trouble to support.  If we
do want to claim it works then it ought to be tested.


All of these (except for the doc "char *" cut-n-pasteo) were due to
trying to stay close to the same interface as PQencryptPasswordConn().

But I agree with your assessment and the attached patch series addresses
all of them.

The original use of PQencryptPasswordConn() in psql passed a NULL for
the algorithm, so I dropped that argument entirely. I also swapped user
and password arguments because as you pointed out that is more natural.

This version returns PGresult. As far as special handling for
client-side errors like OOM, I don't see anything beyond returning a
NULL to signify fatal error, e,g,:

8<--
PGresult *
PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
{
PGresult   *result;

result = (PGresult *) malloc(sizeof(PGresult));
if (!result)
return NULL;
8<--

That is the approach I took.

One thing I have not done but, considered, is adding an additional 
optional parameter to allow "VALID UNTIL" to be set. Seems like it would 
be useful to be able to set an expiration when setting a new password.


No strong opinion about that.


Thanks -- hopefully others will weigh in on that.



I just read through the thread and my conclusion is that, specifically 
related to this patch set (i.e. notwithstanding suggestions for related 
features), there is consensus in favor of adding this feature.


The only code specific comments were Tom's above, which have been 
addressed. If there are no serious objections I plan to commit this 
relatively soon.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 850734a..28b861f 100644
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
*** PQclosePrepared   188
*** 191,193 
--- 191,194 
  PQclosePortal 189
  PQsendClosePrepared   190
  PQsendClosePortal 191
+ PQchangePassword  192
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 912aa14..a6897f8 100644
*** a/src/interfaces/libpq/fe-auth.c
--- b/src/interfaces/libpq/fe-auth.c
*** PQencryptPasswordConn(PGconn *conn, cons
*** 1372,1374 
--- 1372,1459 
  
  	return crypt_pwd;
  }
+ 
+ /*
+  * PQchangePassword -- exported routine to change a password
+  *
+  * This is intended to be used by client applications that wish to
+  * change the password for a user. The password is not sent in
+  * cleartext because it is encrypted on the client side. This is
+  * good because it ensures the cleartext password is never known by
+  * the server, and therefore won't end up in logs, pg_stat displays,
+  * etc. We export the function so that clients won't be dependent
+  * on the implementation specific details with respect to how the
+  * server changes passwords.
+  *
+  * Arguments are a connection object, the SQL name of the target user,
+  * and the cleartext password.
+  *
+  * Return value is the PGresult of the executed ALTER USER statement
+  * or NULL if we never get there. The caller is responsible to PQclear()
+  * the returned PGresult.
+  * 
+  * PQresultStatus() should be called to check the return value for errors,
+  * and PQerrorMessage() used to get more information about such errors.
+  */
+ PGr

Re: Password leakage avoidance

2024-01-06 Thread Joe Conway


On 1/2/24 07:23, Sehrope Sarkuni wrote:
1. There's two sets of defaults, the client program's default and the 
server's default. Need to pick one for each implemented function. They 
don't need to be the same across the board.


Is there a concrete recommendation here?

2. Password encoding should be split out and made available as its own 
functions. Not just as part of a wider "create _or_ alter a user's 
password" function attached to a connection.


It already is split out in libpq[1], unless I don't understand you 
correctly.


[1] 
https://www.postgresql.org/docs/current/libpq-misc.html#LIBPQ-PQENCRYPTPASSWORDCONN



We went a step further and  added an intermediate function that
generates the "ALTER USER ... PASSWORD" SQL.


I don't think we want that in libpq, but in any case separate 
patch/discussion IMHO.


3. We only added an "ALTER USER ... PASSWORD" function, not anything to 
create a user. There's way too many options for that and keeping this 
targeted at just assigning passwords makes it much easier to test.


+1

Also separate patch/discussion, but I don't think the CREATE version is 
needed.


4. RE:defaults, the PGJDBC approach is that the encoding-only function 
uses the driver's default (SCRAM). The "alterUserPassword(...)" uses the 
server's default (again usually SCRAM for modern installs but could be 
something else). It's kind of odd that they're different but the use 
cases are different as well.


Since PQencryptPasswordConn() already exists, and psql's "\password" 
used it with its defaults, I don't think we want to change the behavior. 
The patch as written behaves in a backward compatible way.


5. Our SCRAM specific function allows for customizing the algo iteration 
and salt parameters. That topic came up on hackers previously[1]. Our 
high level "alterUserPassword(...)" function does not have those options 
but it is available as part of our PasswordUtil SCRAM API for advanced 
users who want to leverage it. The higher level functions have defaults 
for iteration counts (4096) and salt size (16-bytes).


Again separate patch/discussion, IMHO.

6. Getting the verbiage right for the PGJDBC version was kind of 
annoying as we wanted to match the server's wording, e.g. 
"password_encryption", but it's clearly hashing, not encryption. We 
settled on "password encoding" for describing the overall task with the 
comments referencing the server's usage of the term 
"password_encryption". Found a similar topic[2] on changing that 
recently as well but looks like that's not going anywhere.


Really this is irrelevant to this discussion, because the new function 
is called PQchangePassword().


The function PQencryptPasswordConn() has been around for a while and the 
horse is out of the gate on that one.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG

2024-01-05 Thread Joe Conway


On 1/5/24 12:56, Robert Haas wrote:

On Sun, Aug 27, 2023 at 4:25 PM Heikki Linnakangas  wrote:

I think you got that it backwards. 'perl_locale_obj' is set to the perl
interpreter's locale, whenever we are *outside* the interpreter.


This thread has had no update for more than 4 months, so I'm marking
the CF entry RwF for now.

It can always be reopened, if Joe or Tristan or Heikki or someone else
picks it up again.



It is definitely a bug, so I do plan to get back to it at some point, 
hopefully sooner rather than later...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: SET ROLE x NO RESET

2023-12-31 Thread Joe Conway


On 12/30/23 17:19, Michał Kłeczek wrote:



On 30 Dec 2023, at 17:16, Eric Hanson  wrote:

What do you think of adding a NO RESET option to the SET ROLE command?


What I proposed some time ago is SET ROLE … GUARDED BY ‘password’, so 
that you could later: RESET ROLE WITH ‘password'


I like that too, but see it as a separate feature. FWIW that is also 
supported by the set_user extension referenced elsewhere on this thread.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Build versionless .so for Android

2023-12-31 Thread Joe Conway


On 12/31/23 01:24, Michael Paquier wrote:

On Sun, Dec 31, 2023 at 06:08:04AM +0100, Matthias Kuhn wrote:

I was wondering if there are a) any comments on the approach and if I
should be handed in for a commitfest (currently waiting for the cooldown
period after account activation, I am not sure how long that is)


Such discussions are adapted in a commit fest entry.  No idea if there
is a cooldown period after account creation before being able to touch
the CF app contents, though.



FWIW I have expedited the cooldown period, so Matthias can log in right 
away.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: SET ROLE x NO RESET

2023-12-30 Thread Joe Conway


On 12/30/23 11:16, Eric Hanson wrote:

Hi,

What do you think of adding a NO RESET option to the SET ROLE command?

Right now Postgres can enforce data security with roles and RLS, but 
role-per-end-user doesn't really scale:  Db connections are per-role, so 
a connection pooler can't share connections across users.  We can work 
around this with policies that use session variables and checks against 
current_user, but it seems like role-per end user would be more 
beautiful.  If SET ROLE had a NO RESET option, you could connect through 
a connection pooler as a privileged user, but downgrade to the user's 
role for the duration of the session.


+1

I agree this would be useful.

In the meantime, in case it helps, see

  https://github.com/pgaudit/set_user

Specifically set_session_auth(text):
-
When set_session_auth(text) is called, the effective session and current 
user is switched to the rolename supplied, irrevocably. Unlike 
set_user() or set_user_u(), it does not affect logging nor allowed 
statements. If set_user.exit_on_error is "on" (the default), and any 
error occurs during execution, a FATAL error is thrown and the backend 
session exits.

-----

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Password leakage avoidance

2023-12-27 Thread Joe Conway


On 12/27/23 16:09, Tom Lane wrote:

Joe Conway  writes:

On 12/27/23 15:39, Peter Eisentraut wrote:

On 23.12.23 16:13, Joe Conway wrote:
The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).



I don't follow how you get from the problem statement to this solution.
This proposal doesn't avoid password leakage, does it?
It just provides a different way to phrase the existing solution.


Yes, a fully built one that is convenient to use, and does not ask 
everyone to roll their own.


It's convenient for users of libpq, I guess, but it doesn't help
anyone not writing C code directly atop libpq.  If this is the
way forward then we need to also press JDBC and other client
libraries to implement comparable functionality.  That's within
the realm of sanity surely, and having a well-thought-through
reference implementation in libpq would help those authors.
So I don't think this is a strike against the patch; but the answer
to Peter's question has to be that this is just part of the solution.


As mentioned downthread by Dave Cramer, JDBC is already onboard.

And as Jonathan said in an adjacent part of the thread:

This should generally helpful, as it allows libpq-based drivers to
adopt this method and provide a safer mechanism for setting/changing
passwords! (which we should also promote once availbale).


Which is definitely something I have had in mind all along.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Password leakage avoidance

2023-12-27 Thread Joe Conway


On 12/27/23 15:39, Peter Eisentraut wrote:

On 23.12.23 16:13, Joe Conway wrote:
I have recently, once again for the umpteenth time, been involved in 
discussions around (paraphrasing) "why does Postgres leak the passwords 
into the logs when they are changed". I know well that the canonical 
advice is something like "use psql with \password if you care about that".


And while that works, it is a deeply unsatisfying answer for me to give 
and for the OP to receive.


The alternative is something like "...well if you don't like that, use 
PQencryptPasswordConn() to roll your own solution that meets your 
security needs".


Again, not a spectacular answer IMHO. It amounts to "here is a 
do-it-yourself kit, go put it together". It occurred to me that we can, 
and really should, do better.


The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).


The usage in psql serves as a ready built-in test for the libpq function 
(patch 0002). Docs included too (patch 0003).


I don't follow how you get from the problem statement to this solution.
This proposal doesn't avoid password leakage, does it?


Yes, it most certainly does. The plaintext password would never be seen 
by the server and therefore never logged. This is exactly why the 
feature already existed in psql.



 It just provides a different way to phrase the existing solution.


Yes, a fully built one that is convenient to use, and does not ask 
everyone to roll their own.


Who is a potential user of this solution? 


Literally every company that has complained that Postgres pollutes their 
logs with plaintext passwords. I have heard the request to provide a 
better solution many times, over many years, while working for three 
different companies.



Right now it just saves a dozen lines in psql, but it's not clear how
it improves anything else.


It is to me, and so far no one else has complained about that. More 
opinions would be welcomed of course.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Password leakage avoidance

2023-12-24 Thread Joe Conway


On 12/24/23 12:22, Tom Lane wrote:

Joe Conway  writes:

Completely unrelated process bikeshedding:
I changed the naming scheme I used for the split patch-set this time. I 
don't know if we have a settled/documented pattern for such naming, but 
the original pattern which I borrowed from someone else's patches was 
"vX--description.patch".


As far as that goes, that filename pattern is what is generated by
"git format-patch".  I agree that the digit-count choices are a tad
odd, but they're not so awful as to be worth trying to override.



Ah, knew it was something like that. I am still a curmudgeon doing 
things the old way ¯\_(ツ)_/¯



The new pattern I picked is "description-vXXX-NN.patch" which fixes all 
of those issues.


Only if you use the same "description" for all patches of a series,
which seems kind of not the point.  In any case, "git format-patch"
is considered best practice for a multi-patch series AFAIK, so we
have to cope with its ideas about how to name the files.
Even if I wanted some differentiating name for the individual patches in 
a set, I still like them to be grouped because it is one unit of work 
from my perspective.


Oh well, I guess I will get with the program and put every patch-set 
into its own directory.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Password leakage avoidance

2023-12-24 Thread Joe Conway


On 12/23/23 11:00, Tom Lane wrote:

Joe Conway  writes:
The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).


Haven't really read the patch, just looked at the docs, but here's
a bit of bikeshedding:


Thanks!


* This seems way too eager to promote the use of md5.  Surely the
default ought to be SCRAM, full stop.  I question whether we even
need an algorithm parameter.  Perhaps it's a good idea for
future-proofing, but we could also plan that the function would
make its own decisions based on noting the server's version.
(libpq is far more likely to be au courant about what to do than
the calling application, IMO.)



* Parameter order seems a bit odd: to me it'd be natural to write
user before password.



* Docs claim return type is char *, but then say bool (which is
also what the code says).  We do not use bool in libpq's API;
the admittedly-hoary convention is to use int with 1/0 values.
Rather than quibble about that, though, I wonder if we should
make the result be the PGresult from the command, which the
caller would be responsible to free.  That would document error
conditions much more flexibly.  On the downside, client-side
errors would be harder to report, particularly OOM, but I think
we have solutions for that in existing functions.



* The whole business of nonblock mode seems fairly messy here,
and I wonder whether it's worth the trouble to support.  If we
do want to claim it works then it ought to be tested.


All of these (except for the doc "char *" cut-n-pasteo) were due to 
trying to stay close to the same interface as PQencryptPasswordConn().


But I agree with your assessment and the attached patch series addresses 
all of them.


The original use of PQencryptPasswordConn() in psql passed a NULL for 
the algorithm, so I dropped that argument entirely. I also swapped user 
and password arguments because as you pointed out that is more natural.


This version returns PGresult. As far as special handling for 
client-side errors like OOM, I don't see anything beyond returning a 
NULL to signify fatal error, e,g,:


8<--
PGresult *
PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
{
PGresult   *result;

result = (PGresult *) malloc(sizeof(PGresult));
if (!result)
return NULL;
8<--

That is the approach I took.

One thing I have not done but, considered, is adding an additional 
optional parameter to allow "VALID UNTIL" to be set. Seems like it would 
be useful to be able to set an expiration when setting a new password.


No strong opinion about that.


Thanks -- hopefully others will weigh in on that.

Completely unrelated process bikeshedding:
I changed the naming scheme I used for the split patch-set this time. I 
don't know if we have a settled/documented pattern for such naming, but 
the original pattern which I borrowed from someone else's patches was 
"vX--description.patch".


The problems I have with that are 1/ there may well be more that 10 
versions of a patch-set, 2/ there are probably never going to be more 
than 2 digits worth of files in a patch-set, and 3/ the description 
coming after the version and file identifiers causes the patches in my 
local directory to sort poorly, intermingling several unrelated patch-sets.


The new pattern I picked is "description-vXXX-NN.patch" which fixes all 
of those issues. Does that bother anyone? *Should* we try to agree on a 
desired pattern (assuming there is not one already)?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 850734a..28b861f 100644
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
*** PQclosePrepared   188
*** 191,193 
--- 191,194 
  PQclosePortal 189
  PQsendClosePrepared   190
  PQsendClosePortal 191
+ PQchangePassword  192
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 912aa14..a6897f8 100644
*** a/src/interfaces/libpq/fe-auth.c
--- b/src/interfaces/libpq/fe-auth.c
*** PQencryptPasswordConn(PGconn *conn, cons
*** 1372,1374 
--- 1372,1459 
  
  	return crypt_pwd;
  }
+ 
+ /*
+  * PQchangePassword -- exported routine to change a password
+  *
+  * This is intended to be used by client applications that wish to
+  * change the password for a user. The password is not sent in
+  * cleartext because it is encrypted on the client side. This is
+  * good because it ensures the cleartext password is never known by
+  * the server, and therefore won't end up in logs, pg_stat displays,
+  * etc. We export the function so that clients won't be dependent
+  * on the implementation specific details with respect to how the
+  *

Password leakage avoidance

2023-12-23 Thread Joe Conway

I have recently, once again for the umpteenth time, been involved in 
discussions around (paraphrasing) "why does Postgres leak the passwords 
into the logs when they are changed". I know well that the canonical 
advice is something like "use psql with \password if you care about that".


And while that works, it is a deeply unsatisfying answer for me to give 
and for the OP to receive.


The alternative is something like "...well if you don't like that, use 
PQencryptPasswordConn() to roll your own solution that meets your 
security needs".


Again, not a spectacular answer IMHO. It amounts to "here is a 
do-it-yourself kit, go put it together". It occurred to me that we can, 
and really should, do better.


The attached patch set moves the guts of \password from psql into the 
libpq client side -- PQchangePassword() (patch 0001).


The usage in psql serves as a ready built-in test for the libpq function 
(patch 0002). Docs included too (patch 0003).


One thing I have not done but, considered, is adding an additional 
optional parameter to allow "VALID UNTIL" to be set. Seems like it would 
be useful to be able to set an expiration when setting a new password.


I will register this in the upcoming commitfest, but meantime 
thought/comments/etc. would be gratefully received.


Thanks,

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 850734a..28b861f 100644
*** a/src/interfaces/libpq/exports.txt
--- b/src/interfaces/libpq/exports.txt
*** PQclosePrepared   188
*** 191,193 
--- 191,194 
  PQclosePortal 189
  PQsendClosePrepared   190
  PQsendClosePortal 191
+ PQchangePassword  192
diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c
index 912aa14..3ee67ba 100644
*** a/src/interfaces/libpq/fe-auth.c
--- b/src/interfaces/libpq/fe-auth.c
*** PQencryptPasswordConn(PGconn *conn, cons
*** 1372,1374 
--- 1372,1463 
  
  	return crypt_pwd;
  }
+ 
+ /*
+  * PQchangePassword -- exported routine to change a password
+  *
+  * This is intended to be used by client applications that wish to
+  * change the password for a user.  The password is not sent in
+  * cleartext because it is encrypted on the client side. This is
+  * good because it ensures the cleartext password is never known by
+  * the server, and therefore won't end up in logs, pg_stat displays,
+  * etc. We export the function so that clients won't be dependent
+  * on the implementation specific details with respect to how the
+  * server changes passwords.
+  *
+  * Arguments are a connection object, the cleartext password, the SQL
+  * name of the user it is for, and a string indicating the algorithm to
+  * use for encrypting the password.  If algorithm is NULL,
+  * PQencryptPasswordConn() queries the server for the current
+  * 'password_encryption' value. If you wish to avoid that, e.g. to avoid
+  * blocking, you can execute 'show password_encryption' yourself before
+  * calling this function, and pass it as the algorithm.
+  *
+  * Return value is a boolean, true for success. On error, an error message
+  * is stored in the connection object, and returns false.
+  */
+ bool
+ PQchangePassword(PGconn *conn, const char *passwd, const char *user,
+  const char *algorithm)
+ {
+ 	char		   *encrypted_password = NULL;
+ 	PQExpBufferData buf;
+ 	bool			success = true;
+ 
+ 	encrypted_password = PQencryptPasswordConn(conn, passwd, user, algorithm);
+ 
+ 	if (!encrypted_password)
+ 	{
+ 		/* PQencryptPasswordConn() already registered the error */
+ 		success = false;
+ 	}
+ 	else
+ 	{
+ 		char	   *fmtpw = NULL;
+ 
+ 		fmtpw = PQescapeLiteral(conn, encrypted_password,
+ strlen(encrypted_password));
+ 
+ 		/* no longer needed, so clean up now */
+ 		PQfreemem(encrypted_password);
+ 
+ 		if (!fmtpw)
+ 		{
+ 			/* PQescapeLiteral() already registered the error */
+ 			success = false;
+ 		}
+ 		else
+ 		{
+ 			char	   *fmtuser = NULL;
+ 
+ 			fmtuser = PQescapeIdentifier(conn, user, strlen(user));
+ 			if (!fmtuser)
+ 			{
+ /* PQescapeIdentifier() already registered the error */
+ success = false;
+ PQfreemem(fmtpw);
+ 			}
+ 			else
+ 			{
+ PGresult   *res;
+ 
+ initPQExpBuffer();
+ printfPQExpBuffer(, "ALTER USER %s PASSWORD %s",
+   fmtuser, fmtpw);
+ 
+ res = PQexec(conn, buf.data);
+ if (!res)
+ 	success = false;
+ else
+ 	PQclear(res);
+ 
+ /* clean up */
+ termPQExpBuffer();
+ PQfreemem(fmtuser);
+ PQfreemem(fmtpw);
+ 			}
+ 		}
+ 	}
+ 
+ 	return success;
+ }
diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h
index 97762d5..f6e7207 100644
*** a/src/interfaces/libpq/libpq-fe.h
--- b/src/interfaces/libpq/libpq-fe.h
*** e

Re: Eager page freeze criteria clarification

2023-12-21 Thread Joe Conway


On 12/21/23 10:56, Melanie Plageman wrote:

On Sat, Dec 9, 2023 at 9:24 AM Joe Conway  wrote:

However, even if we assume a more-or-less normal distribution, we should
consider using subgroups in a way similar to Statistical Process
Control[1]. The reasoning is explained in this quote:

 The Math Behind Subgroup Size

 The Central Limit Theorem (CLT) plays a pivotal role here. According
 to CLT, as the subgroup size (n) increases, the distribution of the
 sample means will approximate a normal distribution, regardless of
 the shape of the population distribution. Therefore, as your
 subgroup size increases, your control chart limits will narrow,
 making the chart more sensitive to special cause variation and more
 prone to false alarms.


I haven't read anything about statistical process control until you
mentioned this. I read the link you sent and also googled around a
bit. I was under the impression that the more samples we have, the
better. But, it seems like this may not be the assumption in
statistical process control?

It may help us to get more specific. I'm not sure what the
relationship between "unsets" in my code and subgroup members would
be.  The article you linked suggests that each subgroup should be of
size 5 or smaller. Translating that to my code, were you imagining
subgroups of "unsets" (each time we modify a page that was previously
all-visible)?


Basically, yes.

It might not makes sense, but I think we could test the theory by 
plotting a histogram of the raw data, and then also plot a histogram 
based on sub-grouping every 5 sequential values in your accumulator.


If the former does not look very normal (I would guess most workloads it 
will be skewed with a long tail) and the latter looks to be more normal, 
then it would say we were on the right track.


There are statistical tests for "normalness" that could be applied too 
( e.g. 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6350423/#sec2-13title ) 
which be a more rigorous approach, but the quick look at histograms 
might be sufficiently convincing.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Eager page freeze criteria clarification

2023-12-09 Thread Joe Conway

er
table vacuum, especially in case of long-running vacuums.

To benchmark this new heuristic (I'm calling it algo 6 since it is the
6th I've proposed on this thread), I've used a modified subset of my
original workloads:

Workloads

C. Shifting hot set
32 clients inserting multiple rows and then updating an indexed
column of a row on a different page containing data they formerly
inserted. Only recent data is updated.

H. Append only table
32 clients, each inserting a single row at a time

I. Work queue
32 clients, each inserting a row, then updating an indexed column in
2 different rows in nearby pages, then deleting 1 of the updated rows


Workload C:

Algo | Table| AVs | Page Freezes | Pages Frozen | % Frozen
-|--|-|--|--|-
6 | hot_tail |  14 |2,111,824 |1,643,217 |53.4%
M | hot_tail |  16 |  241,605 |3,921 | 0.1%


Algo | WAL GB | Cptr Bgwr Writes | Other r/w  | AV IO time |TPS
-||--|||
6 |193 |5,473,949 | 12,793,574 | 14,870 | 28,397
M |207 |5,213,763 | 20,362,315 | 46,190 | 28,461

Algorithm 6 freezes all of the cold data and doesn't freeze the current
working set. The notable thing is how much this reduces overall system
I/O. On master, autovacuum is doing more than 3x the I/O and the rest of
the system is doing more than 1.5x the I/O. I suspect freezing data when
it is initially vacuumed is saving future vacuums from having to evict
pages of the working set and read in cold data.


Workload H:

Algo | Table | AVs | Page Freezes | Pages Frozen | % frozen
-|---|-|--|--|-
6 | hthistory |  22 |  668,448 |  668,003 |  87%
M | hthistory |  22 |0 |0 |   0%


Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time |TPS
-||--|---||
6 | 14 |  725,103 |   725,575 |  1 | 43,535
M | 13 |  693,945 |   694,417 |  1 | 43,522

The insert-only table is mostly frozen at the end. There is more I/O
done but not at the expense of TPS. This is exactly what we want.


Workload I:

Algo | Table | AVs | Page Freezes | Pages Frozen | % Frozen
-|---|-|--|--|-
6 | workqueue | 234 |0 |4,416 | 78%
M | workqueue | 234 |0 |4,799 | 87%


Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO Time |TPS
-||--|---||
6 | 74 |   64,345 |64,813 |  1 | 36,366
M | 73 |   63,468 |63,936 |  1 | 36,145

What we want is for the work queue table to freeze as little as
possible, because we will be constantly modifying the data. Both on
master and with algorithm 6 we do not freeze tuples on any pages. You
will notice, however, that much of the table is set frozen in the VM at
the end. This is because we set pages all frozen in the VM if they are
technically all frozen even if we do not freeze tuples on the page. This
is inexpensive and not under the control of the freeze heuristic.

Overall, the behavior of this new adaptive freezing method seems to be
exactly what we want. The next step is to decide how many values to
remove from the accumulator and benchmark cases where old data is
deleted.

I'd be delighted to receive any feedback, ideas, questions, or review.



This is well thought out, well described, and a fantastic improvement in 
my view -- well done!


I do think we will need to consider distributions other than normal, but 
I don't know offhand what they will be.


However, even if we assume a more-or-less normal distribution, we should 
consider using subgroups in a way similar to Statistical Process 
Control[1]. The reasoning is explained in this quote:


The Math Behind Subgroup Size

The Central Limit Theorem (CLT) plays a pivotal role here. According
to CLT, as the subgroup size (n) increases, the distribution of the
sample means will approximate a normal distribution, regardless of
the shape of the population distribution. Therefore, as your
subgroup size increases, your control chart limits will narrow,
making the chart more sensitive to special cause variation and more
prone to false alarms.


[1] 
https://www.qualitygurus.com/rational-subgrouping-in-statistical-process-control/


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Joe Conway


On 12/8/23 14:45, Daniel Verite wrote:

Joe Conway wrote:


copyto_json.007.diff


When the source has json fields with non-significant line feeds, the COPY
output has these line feeds too, which makes the output incompatible
with rule #2 at https://jsonlines.org  ("2. Each Line is a Valid JSON
Value").

create table j(f json);

insert into j values('{"a":1,
"b":2
}');

copy j to stdout (format json);

Result:
{"f":{"a":1,
"b":2
}}

Is that expected? copy.sgml in 007 doesn't describe the output
in terms of lines so it's hard to tell from the doc.


The patch as-is just does the equivalent of row_to_json():
8<
select row_to_json(j) from j;
 row_to_json
--
 {"f":{"a":1,+
 "b":2   +
 }}
(1 row)
8<

So yeah, that is how it works today. I will take a look at what it would 
take to fix it.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway


On 12/7/23 09:11, David G. Johnston wrote:
Those are all the same breakage though - if truly interpreted as data 
rows the protocol is basically written such that the array format is not 
supportable and only the lines format can be used.  Hence my “format 0 
doesn’t work” comment for array output and we should explicitly add 
format 2 where we explicitly decouple lines of output from rows of 
data.  That said, it would seem in practice format 0 already decouples 
them and so the current choice of the brackets on their own lines is 
acceptable.


I’d prefer to keep them on their own line.


WFM ¯\_(ツ)_/¯

I am merely responding with options to the many people opining on the 
thread.


I also don’t know why you introduced another level of object nesting 
here.  That seems quite undesirable.


I didn't add anything. It is an artifact of the particular query I wrote 
in the copy to statement (I did "select ss from ss" instead of "select * 
from ss"), mea culpa.


This is what the latest patch, as written today, outputs:
8<--
copy
(select 1, g.i from generate_series(1, 3) g(i))
to stdout (format json, force_array);
[
 {"?column?":1,"i":1}
,{"?column?":1,"i":2}
,{"?column?":1,"i":3}
]
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway


On 12/7/23 08:52, Joe Conway wrote:

Or maybe this is preferred?
8<--
[{"ss":{"f1":1,"f2":1}},
   {"ss":{"f1":1,"f2":2}},
   {"ss":{"f1":1,"f2":3}}]
8<--


I don't know why my mail client keeps adding extra spaces, but the 
intention here is a single space in front of row 2 and 3 in order to 
line the json objects up at column 2.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway


On 12/7/23 08:35, Daniel Verite wrote:

Joe Conway wrote:

The attached should fix the CopyOut response to say one column. I.e. it 
ought to look something like:


Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...

this number must be the number of columns in the source.
That is for COPY table(a,b,c)   the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.

I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.

The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.


Ok, that sounds reasonable to me -- I will revert that change.


The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"

ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.


If we really think that is a problem I can see about changing it to this 
format for json array:


8<--
copy
(
  with ss(f1, f2) as
  (
select 1, g.i from generate_series(1, 3) g(i)
  )
  select ss from ss
) to stdout (format json, force_array);
[{"ss":{"f1":1,"f2":1}}
,{"ss":{"f1":1,"f2":2}}
,{"ss":{"f1":1,"f2":3}}]
8<--

Is this acceptable to everyone?

Or maybe this is preferred?
8<--
[{"ss":{"f1":1,"f2":1}},
 {"ss":{"f1":1,"f2":2}},
 {"ss":{"f1":1,"f2":3}}]
8<--

Or as long as we are painting the shed, maybe this?
8<--
[{"ss":{"f1":1,"f2":1}},
{"ss":{"f1":1,"f2":2}},
{"ss":{"f1":1,"f2":3}}]
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway


On 12/6/23 21:56, Nathan Bossart wrote:

On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote:

If Nathan's perf results hold up elsewhere, it seems like some
micro-optimization around the text-pushing (appendStringInfoString)
might be more useful than caching.  The 7% spent in cache lookups
could be worth going after later, but it's not the top of the list.


Hah, it turns out my benchmark of 110M integers really stresses the
JSONTYPE_NUMERIC path in datum_to_json_internal().  That particular path
calls strlen() twice: once for IsValidJsonNumber(), and once in
appendStringInfoString().  If I save the result from IsValidJsonNumber()
and give it to appendBinaryStringInfo() instead, the COPY goes ~8% faster.
It's probably worth giving datum_to_json_internal() a closer look in a new
thread.


Yep, after looking through that code I was going to make the point that 
your 11 integer test was over indexing on that one type. I am sure there 
are other micro-optimizations to be made here, but I also think that it 
is outside the scope of the COPY TO JSON patch.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 20:09, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 5:57 PM Joe Conway <mailto:m...@joeconway.com>> wrote:

On 12/6/23 19:39, David G. Johnston wrote:
 > On Wed, Dec 6, 2023 at 4:45 PM Joe Conway mailto:m...@joeconway.com>
 > <mailto:m...@joeconway.com <mailto:m...@joeconway.com>>> wrote:

 > But I still cannot shake the belief that using a format code of 1 -
 > which really could be interpreted as meaning "textual csv" in
practice -
 > for this JSON output is unwise and we should introduce a new integer
 > value for the new fundamental output format.

No, I am pretty sure you still have that wrong. The "1" means binary
mode

Ok.  I made the same typo twice, I did mean to write 0 instead of 1.

Fair enough.

But the point that we should introduce a 2 still stands.  The new code 
would mean: use text output functions but that there is no inherent 
tabular structure in the underlying contents.  Instead the copy format 
was JSON and the output layout is dependent upon the json options in the 
copy command and that there really shouldn't be any attempt to turn the 
contents directly into a tabular data structure like you presently do 
with the CSV data under format 0.  Ignore the column count and column 
formats as they are fixed or non-existent.

I think that amounts to a protocol change, which we tend to avoid at all 
costs.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 18:09, Joe Conway wrote:

On 12/6/23 14:47, Joe Conway wrote:

On 12/6/23 13:59, Daniel Verite wrote:

Andrew Dunstan wrote:

IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).


"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 3
Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?



Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to
assemble the data correctly despite the extra CopyData messages.


Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?



The attached should fix the CopyOut response to say one column. I.e. it 
ought to look something like:


PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 1
 Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data: [...]


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..8915fb3 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,49 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 207,220 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 378,396 
   
  
 
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of square brackets as array decorations at the beginning
+   and end of output, and commas between the rows. It is allowed only in
+   COPY TO, and only when using
+   JSON format. The default is
+   false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..23b570f 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,425 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 444,451 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 543,555 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_array") == 0)
+ 		{
+ 			if (force_array_specified)
+ errorConflictingDefElem(defel, pstate);
+ 			force_array_specified = true;
+ 			opts_out->force_array = defGetBoolean(defel);
+ 		}
  		else if (strcmp(defel->defname, "convert_selectively") == 0)
  		{
  			/*
*** ProcessCopyOptions(ParseState *pstate,
*** 598,603 
--- 608,625 
  (errcode(ERRCODE_SYNTAX_

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 19:39, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 4:45 PM Joe Conway <mailto:m...@joeconway.com>> wrote:



" The backend sends a CopyOutResponse message to the frontend, followed
     by zero or more CopyData messages (always one per row), followed by
     CopyDone"

probably "always one per row" would be changed to note that json array
format outputs two extra rows for the start/end bracket.


Fair, I was ascribing much more semantic meaning to this than it wants.

I don't see any real requirement, given the lack of semantics, to 
mention JSON at all.  It is one CopyData per row, regardless of the 
contents.  We don't delineate between the header and non-header data in 
CSV.  It isn't a protocol concern.


good point

But I still cannot shake the belief that using a format code of 1 - 
which really could be interpreted as meaning "textual csv" in practice - 
for this JSON output is unwise and we should introduce a new integer 
value for the new fundamental output format.


No, I am pretty sure you still have that wrong. The "1" means binary 
mode. As in

8<--
FORMAT

Selects the data format to be read or written: text, csv (Comma 
Separated Values), or binary. The default is text.

8<--

That is completely separate from text and csv. It literally means to use 
the binary output functions instead of the usual ones:


8<--
if (cstate->opts.binary)
getTypeBinaryOutputInfo(attr->atttypid,
_func_oid,
);
else
getTypeOutputInfo(attr->atttypid,
  _func_oid,
  );
8<--

Both "text" and "csv" mode use are non-binary output formats. I believe 
the JSON output format is also non-binary.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 18:38, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 4:28 PM David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:

On Wed, Dec 6, 2023 at 4:09 PM Joe Conway mailto:m...@joeconway.com>> wrote:

On 12/6/23 14:47, Joe Conway wrote:
 > On 12/6/23 13:59, Daniel Verite wrote:
 >>      Andrew Dunstan wrote:
 >>
 >>> IMNSHO, we should produce either a single JSON
 >>> document (the ARRAY case) or a series of JSON documents,
one per row
 >>> (the LINES case).
 >>
 >> "COPY Operations" in the doc says:
 >>
 >> " The backend sends a CopyOutResponse message to the
frontend, followed
 >>     by zero or more CopyData messages (always one per row),
followed by
 >>     CopyDone".
 >>
 >> In the ARRAY case, the first messages with the copyjsontest
 >> regression test look like this (tshark output):
 >>
 >> PostgreSQL
 >>      Type: CopyOut response
 >>      Length: 13
 >>      Format: Text (0)
 >>      Columns: 3
 >>      Format: Text (0)

 > Anything receiving this and looking for a json array should
know how to
 > assemble the data correctly despite the extra CopyData messages.

Hmm, maybe the real problem here is that Columns do not equal
"3" for
the json mode case -- that should really say "1" I think,
because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe
those
are less likely to cause some hazard?

What is the limitation, if any, of introducing new type codes for
these.  n = 2..N for the different variants?  Or even -1 for "raw
text"?  And document that columns and structural rows need to be
determined out-of-band.  Continuing to use 1 (text) for this non-csv
data seems like a hack even if we can technically make it function. 
The semantics, especially for the array case, are completely

discarded or wrong.

Also, it seems like this answer would be easier to make if we implement 
COPY FROM now since how is the server supposed to deal with decomposing 
this data into tables without accurate type information?  I don't see 
implementing only half of the feature being a good idea.  I've had much 
more desire for FROM compared to TO personally.

Several people have weighed in on the side of getting COPY TO done by 
itself first. Given how long this discussion has already become for a 
relatively small and simple feature, I am a big fan of not expanding the 
scope now.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 18:28, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 4:09 PM Joe Conway <mailto:m...@joeconway.com>> wrote:

On 12/6/23 14:47, Joe Conway wrote:
 > On 12/6/23 13:59, Daniel Verite wrote:
 >>      Andrew Dunstan wrote:
 >>
 >>> IMNSHO, we should produce either a single JSON
 >>> document (the ARRAY case) or a series of JSON documents, one
per row
 >>> (the LINES case).
 >>
 >> "COPY Operations" in the doc says:
 >>
 >> " The backend sends a CopyOutResponse message to the frontend,
followed
 >>     by zero or more CopyData messages (always one per row),
followed by
 >>     CopyDone".
 >>
 >> In the ARRAY case, the first messages with the copyjsontest
 >> regression test look like this (tshark output):
 >>
 >> PostgreSQL
 >>      Type: CopyOut response
 >>      Length: 13
 >>      Format: Text (0)
 >>      Columns: 3
 >>      Format: Text (0)

 > Anything receiving this and looking for a json array should know
how to
 > assemble the data correctly despite the extra CopyData messages.

Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?

What is the limitation, if any, of introducing new type codes for 
these.  n = 2..N for the different variants?  Or even -1 for "raw 
text"?  And document that columns and structural rows need to be 
determined out-of-band.  Continuing to use 1 (text) for this non-csv 
data seems like a hack even if we can technically make it function.  The 
semantics, especially for the array case, are completely discarded or wrong.

I am not following you here. SendCopyBegin looks like this currently:

8<
SendCopyBegin(CopyToState cstate)
{
StringInfoData buf;
int natts = list_length(cstate->attnumlist);
int16   format = (cstate->opts.binary ? 1 : 0);
int i;

pq_beginmessage(, PqMsg_CopyOutResponse);
pq_sendbyte(, format);  /* overall format */
pq_sendint16(, natts);
for (i = 0; i < natts; i++)
pq_sendint16(, format); /* per-column formats */
pq_endmessage();
cstate->copy_dest = COPY_FRONTEND;
}
8<

The "1" is saying are we binary mode or not. JSON mode will never be 
sending in binary in the current implementation at least. And it always 
aggregates all the columns as one json object. So the correct answer is 
(I think):

8<
*** SendCopyBegin(CopyToState cstate)
*** 146,154 

pq_beginmessage(, PqMsg_CopyOutResponse);
pq_sendbyte(, format);  /* overall format */
!   pq_sendint16(, natts);
!   for (i = 0; i < natts; i++)
!   pq_sendint16(, format); /* per-column formats */
pq_endmessage();
cstate->copy_dest = COPY_FRONTEND;
  }
--- 150,169 

pq_beginmessage(, PqMsg_CopyOutResponse);
pq_sendbyte(, format);  /* overall format */
!   if (!cstate->opts.json_mode)
!   {
!   pq_sendint16(, natts);
!   for (i = 0; i < natts; i++)
!   pq_sendint16(, format); /* per-column formats */
!   }
!   else
!   {
!   /*
!* JSON mode is always one non-binary column
!*/
!   pq_sendint16(, 1);
!   pq_sendint16(, 0);
!   }
pq_endmessage();
cstate->copy_dest = COPY_FRONTEND;
  }
8<

That still leaves the need to fix the documentation:

" The backend sends a CopyOutResponse message to the frontend, followed
   by zero or more CopyData messages (always one per row), followed by
   CopyDone"

probably "always one per row" would be changed to note that json array 
format outputs two extra rows for the start/end bracket.

In fact, as written the patch does this:
8<
COPY (SELECT x.i, 'val' || x.i as v FROM
  generate_series(1, 3) x(i) WHERE false)
TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY);
[
]
8<

Not sure if that is a problem or not.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 14:47, Joe Conway wrote:

On 12/6/23 13:59, Daniel Verite wrote:

Andrew Dunstan wrote:

IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).


"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 3
Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?



Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to
assemble the data correctly despite the extra CopyData messages.


Hmm, maybe the real problem here is that Columns do not equal "3" for 
the json mode case -- that should really say "1" I think, because the 
row is not represented as 3 columns but rather 1 json object.


Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more 
"rows" that actual output rows (the "[" and the "]"), but maybe those 
are less likely to cause some hazard?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 16:42, Sehrope Sarkuni wrote:
On Wed, Dec 6, 2023 at 4:29 PM Joe Conway <mailto:m...@joeconway.com>> wrote:

 > 1. Outputting a top level JSON object without the additional column
 > keys. IIUC, the top level keys are always the column names. A
common use
 > case would be a single json/jsonb column that is already formatted
 > exactly as the user would like for output. Rather than enveloping
it in
 > an object with a dedicated key, it would be nice to be able to
output it
 > directly. This would allow non-object results to be outputted as
well
 > (e.g., lines of JSON arrays, numbers, or strings). Due to how
JSON is
 > structured, I think this would play nice with the JSON lines v.s.
array
 > concept.
 >
 > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
 > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
 > SOME_OPTION_TO_NOT_ENVELOPE)
 > {"foo":1}
 > {"foo":2}
 > {"foo":3}

Your example does not match what you describe, or do I misunderstand? I
thought your goal was to eliminate the repeated "foo" from each row...

The "foo" in this case is explicit as I'm adding it when building the 
object. What I was trying to show was not adding an additional object 
wrapper / envelope.

So each row is:

{"foo":1}

Rather than:

"{"json_build_object":{"foo":1}}

I am still getting confused ;-)

Let's focus on the current proposed patch with a "minimum required 
feature set".

Right now the default behavior is "JSON lines":
8<---
COPY (SELECT x.i, 'val' || x.i as v FROM
  generate_series(1, 3) x(i))
TO STDOUT WITH (FORMAT JSON);
{"i":1,"v":"val1"}
{"i":2,"v":"val2"}
{"i":3,"v":"val3"}
8<---

and the other, non-default option is "JSON array":
8<---
COPY (SELECT x.i, 'val' || x.i as v FROM
  generate_series(1, 3) x(i))
TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY);
[
 {"i":1,"v":"val1"}
,{"i":2,"v":"val2"}
,{"i":3,"v":"val3"}
]
8<---

So the questions are:
1. Do those two formats work for the initial implementation?
2. Is the default correct or should it be switched
   e.g. rather than specifying FORCE_ARRAY to get an
   array, something like FORCE_NO_ARRAY to get JSON lines
   and the JSON array is default?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 11:28, Sehrope Sarkuni wrote:

Big +1 to this overall feature.


cool!

Regarding the defaults for the output, I think JSON lines (rather than a 
JSON array of objects) would be preferred. It's more natural to combine 
them and generate that type of data on the fly rather than forcing 
aggregation into a single object.


So that is +2 (Sehrope and me) for the status quo (JSON lines), and +2 
(Andrew and Davin) for defaulting to json arrays. Anyone else want to 
weigh in on that issue?


Couple more features / use cases come to mind as well. Even if they're 
not part of a first round of this feature I think it'd be helpful to 
document them now as it might give some ideas for what does make that 
first cut:


1. Outputting a top level JSON object without the additional column 
keys. IIUC, the top level keys are always the column names. A common use 
case would be a single json/jsonb column that is already formatted 
exactly as the user would like for output. Rather than enveloping it in 
an object with a dedicated key, it would be nice to be able to output it 
directly. This would allow non-object results to be outputted as well 
(e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is 
structured, I think this would play nice with the JSON lines v.s. array 
concept.


COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM 
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, 
SOME_OPTION_TO_NOT_ENVELOPE)

{"foo":1}
{"foo":2}
{"foo":3}


Your example does not match what you describe, or do I misunderstand? I 
thought your goal was to eliminate the repeated "foo" from each row...


2. An option to ignore null fields so they are excluded from the output. 
This would not be a default but would allow shrinking the total size of 
the output data in many situations. This would be recursive to allow 
nested objects to be shrunk down (not just the top level). This might be 
worthwhile as a standalone JSON function though handling it during 
output would be more efficient as it'd only be read once.


COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM 
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, 
SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS)

{}
{"foo":2}
{"foo":3}


clear enough I think

3. Reverse of #2 when copying data in to allow defaulting missing fields 
to NULL.


good to record the ask, but applies to a different feature (COPY FROM 
instead of COPY TO).


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 11:44, Nathan Bossart wrote:

On Wed, Dec 06, 2023 at 10:33:49AM -0600, Nathan Bossart wrote:

(format csv)
Time: 12295.480 ms (00:12.295)
Time: 12311.059 ms (00:12.311)
Time: 12305.469 ms (00:12.305)

(format json)
Time: 24568.621 ms (00:24.569)
Time: 23756.234 ms (00:23.756)
Time: 24265.730 ms (00:24.266)


I should also note that the json output is 85% larger than the csv output.


I'll see if I can add some caching to composite_to_json(), but based on 
the relative data size it does not sound like there is much performance 
left on the table to go after, no?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 13:59, Daniel Verite wrote:

Andrew Dunstan wrote:

IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).


"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 3
Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?



Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to 
assemble the data correctly despite the extra CopyData messages.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 10:44, Tom Lane wrote:

Joe Conway  writes:
I believe this is ready to commit unless there are further comments or 
objections.


I thought we were still mostly at proof-of-concept stage?


The concept is narrowly scoped enough that I think we are homing in on 
the final patch.



In particular, has anyone done any performance testing?
I'm concerned about that because composite_to_json() has
zero capability to cache any metadata across calls, meaning
there is going to be a large amount of duplicated work
per row.


I will devise some kind of test and report back. I suppose something 
with many rows and many narrow columns comparing time to COPY 
text/csv/json modes would do the trick?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 10:32, Andrew Dunstan wrote:


On 2023-12-06 We 08:49, Joe Conway wrote:

On 12/6/23 07:36, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:46, Joe Conway wrote:

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of 
MSSQL server I think [1] which had the non-array with commas 
import and export style. It was not that tough to support and the 
code as written already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some 
absurdity in
the CSV code (much of it down to me), so maybe we need to be 
liberal in
what we accept here too. IMNSHO, we should produce either a single 
JSON

document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).


So your preference would be to not allow the non-array-with-commas 
case but if/when we implement COPY FROM we would accept that format? 
As in Postel'a law ("be conservative in what you do, be liberal in 
what you accept from others")?



Yes, I think so.


Awesome. The attached does it that way. I also ran pgindent.

I believe this is ready to commit unless there are further comments or 
objections.


Sorry to bikeshed a little more, I'm a bit late looking at this.

I suspect that most users will actually want the table as a single JSON
document, so it should probably be the default. In any case FORCE_ARRAY
as an option has a slightly wrong feel to it.


Sure, I can make that happen, although I figured that for the 
many-rows-scenario the single array size might be an issue for whatever 
you are importing into.



I'm having trouble coming up with a good name for the reverse of
that, off the top of my head.


Will think about it and propose something with the next patch revision.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 07:36, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:46, Joe Conway wrote:

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as 
written already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some absurdity in
the CSV code (much of it down to me), so maybe we need to be liberal in
what we accept here too. IMNSHO, we should produce either a single JSON
document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).


So your preference would be to not allow the non-array-with-commas 
case but if/when we implement COPY FROM we would accept that format? 
As in Postel'a law ("be conservative in what you do, be liberal in 
what you accept from others")?



Yes, I think so.


Awesome. The attached does it that way. I also ran pgindent.

I believe this is ready to commit unless there are further comments or 
objections.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Add json format mode to COPY TO

Add json format mode support to COPY TO, which includes two output
variations: 1) "json lines" which is each row as a json object delimited
by newlines (the default); and 2) "json array" which is the same as #1,
but with the addition of a leading "[", trailing "]", and comma row
delimiters, to form a valid json array.

Early versions: helpful hints/reviews provided by Nathan Bossart,
Tom Lane, and Maciek Sakrejda. Final versions: reviewed by Andrew Dunstan
and Davin Shearer.

Requested-by: Davin Shearer
Author: Joe Conway
Reviewed-by: Andrew Dunstan, Davin Shearer 
Discussion: https://postgr.es/m/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..8915fb3 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,49 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 207,220 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 378,396 
   
  
 
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of square brackets as array decorations at the beginning
+   and end of output, and commas between the rows. It is allowed only in
+   COPY TO, and only when using
+   JSON format. The default is
+   false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..23b570f 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,425 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 444,451 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 543,555 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_array") == 0)
+ 		{
+ 			if (force_array_specified)
+ errorConflictingDefElem(defel, pstate);
+ 			force_array_specified = true;
+ 			opts_out->force_array = defGetBoolean(defel);
+ 		}
  		else if (strcmp(defel->defname, "convert_selectively") == 0)
  		{
  			/*
*** ProcessCopyOptions(ParseState *pstate,
*** 598,603 
--- 608,625 
  (errcode(ERRCODE_SYNTAX_ERROR),
   errmsg("cannot specify DEFAULT in BINARY mode")));
  
+ 	if (opts_out->

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as written 
already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some absurdity in
the CSV code (much of it down to me), so maybe we need to be liberal in
what we accept here too. IMNSHO, we should produce either a single JSON
document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).



So your preference would be to not allow the non-array-with-commas case 
but if/when we implement COPY FROM we would accept that format? As in 
Postel'a law ("be conservative in what you do, be liberal in what you 
accept from others")?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 16:12, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like 
maybe including into a larger array.  I definitely appreciate these 
options and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, 
false), (false, true), (true, true)}? Technically you've already 
shown me the (false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array 
case. It says so here:

8<---
+    
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and 
only

+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+ if (opts_out->force_array &&
+ force_row_delimiter_specified &&
+ !opts_out->force_row_delimiter)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("cannot specify FORCE_ROW_DELIMITER 
false with FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?



But what's the point of having it if you're not using FORCE_ARRAY?



See the follow up email -- other databases support it so why not? It 
seems to be a thing...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and export 
style. It was not that tough to support and the code as written already 
does it, so why not?


[1] 
https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe 
including into a larger array.  I definitely appreciate these options 
and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), 
(false, true), (true, true)}?  Technically you've already shown me the 
(false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array case. 
It says so here:

8<---
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+   if (opts_out->force_array &&
+   force_row_delimiter_specified &&
+   !opts_out->force_row_delimiter)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 	 errmsg("cannot specify FORCE_ROW_DELIMITER false with 
FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 12:43, Davin Shearer wrote:

Joe, those test cases look great and the outputs are the same as `jq`.




Forward slash escaping is optional, so not escaping them in Postgres is 
okay. The important thing is that the software _reading_ JSON 
interprets both '\/' and '/' as '/'.


Thanks for the review and info. I modified the existing regression test 
thus:


8<--
create temp table copyjsontest (
id bigserial,
f1 text,
f2 timestamptz);

insert into copyjsontest
  select g.i,
 CASE WHEN g.i % 2 = 0 THEN
   'line with '' in it: ' || g.i::text
 ELSE
   'line with " in it: ' || g.i::text
 END,
 'Mon Feb 10 17:32:01 1997 PST'
  from generate_series(1,5) as g(i);

insert into copyjsontest (f1) values
(E'aaa\"bbb'::text),
(E'aaa\\bbb'::text),
(E'aaa\/bbb'::text),
(E'aaa\'::text),
(E'aaa\fbbb'::text),
(E'aaa\nbbb'::text),
(E'aaa\rbbb'::text),
(E'aaa\tbbb'::text);
copy copyjsontest to stdout json;
{"id":1,"f1":"line with \" in it: 1","f2":"1997-02-10T20:32:01-05:00"}
{"id":2,"f1":"line with ' in it: 2","f2":"1997-02-10T20:32:01-05:00"}
{"id":3,"f1":"line with \" in it: 3","f2":"1997-02-10T20:32:01-05:00"}
{"id":4,"f1":"line with ' in it: 4","f2":"1997-02-10T20:32:01-05:00"}
{"id":5,"f1":"line with \" in it: 5","f2":"1997-02-10T20:32:01-05:00"}
{"id":1,"f1":"aaa\"bbb","f2":null}
{"id":2,"f1":"aaa\\bbb","f2":null}
{"id":3,"f1":"aaa/bbb","f2":null}
{"id":4,"f1":"aaa\","f2":null}
{"id":5,"f1":"aaa\fbbb","f2":null}
{"id":6,"f1":"aaa\nbbb","f2":null}
{"id":7,"f1":"aaa\rbbb","f2":null}
{"id":8,"f1":"aaa\tbbb","f2":null}
8<--

I think the code, documentation, and tests are in pretty good shape at 
this point. Latest version attached.


Any other comments or complaints out there?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Add json format mode to COPY TO

Add json format mode support to COPY TO, which includes three output
variations: 1) "json lines" which is each row as a json object delimited
by newlines (the default); 2) "json lines", except include comma delimiters
between json objects; and 3) "json array" which is the same as #2, but with
the addition of a leading "[" and trailing "]" to form a valid json array.

Early versions: helpful hints/reviews provided by Nathan Bossart,
Tom Lane, and Maciek Sakrejda. Final versions: reviewed by Andrew Dunstan
and Davin Shearer.

Requested-by: Davin Shearer
Author: Joe Conway
Reviewed-by: Andrew Dunstan, Davin Shearer 
Discussion: https://postgr.es/m/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5


diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/4/23 21:54, Joe Conway wrote:

On 12/4/23 17:55, Davin Shearer wrote:

There are however a few characters that need to be escaped



 1. |"|(double quote)
 2. |\|(backslash)
 3. |/|(forward slash)
 4. |\b|(backspace)
 5. |\f|(form feed)
 6. |\n|(new line)
 7. |\r|(carriage return)
 8. |\t|(horizontal tab)

These characters should be represented in the test cases to see how the 
escaping behaves and to ensure that the escaping is done properly per 
JSON requirements.


I can look at adding these as test cases.

So I did a quick check:
8<--
with t(f1) as
(
  values
(E'aaa\"bbb'::text),
(E'aaa\\bbb'::text),
(E'aaa\/bbb'::text),
(E'aaa\'::text),
(E'aaa\fbbb'::text),
(E'aaa\nbbb'::text),
(E'aaa\rbbb'::text),
(E'aaa\tbbb'::text)
)
select
  length(t.f1),
  t.f1,
  row_to_json(t)
from t;
 length | f1  |row_to_json
+-+---
  7 | aaa"bbb | {"f1":"aaa\"bbb"}
  7 | aaa\bbb | {"f1":"aaa\\bbb"}
  7 | aaa/bbb | {"f1":"aaa/bbb"}
  7 | aaa\x08bbb  | {"f1":"aaa\"}
  7 | aaa\x0Cbbb  | {"f1":"aaa\fbbb"}
  7 | aaa+| {"f1":"aaa\nbbb"}
| bbb |
  7 | aaa\rbbb| {"f1":"aaa\rbbb"}
  7 | aaa bbb | {"f1":"aaa\tbbb"}
(8 rows)

8<--

This is all independent of my patch for COPY TO. If I am reading that 
correctly, everything matches Davin's table *except* the forward slash 
("/"). I defer to the experts on the thread to debate that...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: introduce dynamic shared memory registry

2023-12-05 Thread Joe Conway


On 12/4/23 22:46, Nathan Bossart wrote:

Every once in a while, I find myself wanting to use shared memory in a
loadable module without requiring it to be loaded at server start via
shared_preload_libraries.  The DSM API offers a nice way to create and
manage dynamic shared memory segments, so creating a segment after server
start is easy enough.  However, AFAICT there's no easy way to teach other
backends about the segment without storing the handles in shared memory,
which puts us right back at square one.

The attached 0001 introduces a "DSM registry" to solve this problem.  The
API provides an easy way to allocate/initialize a segment or to attach to
an existing one.  The registry itself is just a dshash table that stores
the handles keyed by a module-specified string.  0002 adds a test for the
registry that demonstrates basic usage.

I don't presently have any concrete plans to use this for anything, but I
thought it might be useful for extensions for caching, etc. and wanted to
see whether there was any interest in the feature.


Notwithstanding any dragons there may be, and not having actually looked 
at the the patches, I love the concept! +


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Joe Conway


On 12/4/23 17:55, Davin Shearer wrote:
Sorry about the top posting / top quoting... the link you sent me gives 
me a 404.  I'm not exactly sure what top quoting / posting means and 
Googling those terms wasn't helpful for me, but I've removed the quoting 
that my mail client is automatically "helpfully" adding to my emails.  I 
mean no offense.


No offense taken. But it is worthwhile to conform to the very long 
established norms of the mailing lists on which you participate. See:


  https://en.wikipedia.org/wiki/Posting_style

I would describe the Postgres list style (based on that link) as

   "inline replying, in which the different parts of the reply follow
the relevant parts of the original post...[with]...trimming of the
original text"


There are however a few characters that need to be escaped



 1. |"|(double quote)
 2. |\|(backslash)
 3. |/|(forward slash)
 4. |\b|(backspace)
 5. |\f|(form feed)
 6. |\n|(new line)
 7. |\r|(carriage return)
 8. |\t|(horizontal tab)

These characters should be represented in the test cases to see how the 
escaping behaves and to ensure that the escaping is done properly per 
JSON requirements.


I can look at adding these as test cases. The latest version of the 
patch (attached) includes some of that already. For reference, the tests 
so far include this:


8<---
test=# select * from copytest;
  style  |   test   | filler
-+--+
 DOS | abc\r   +|  1
 | def  |
 Unix| abc +|  2
 | def  |
 Mac | abc\rdef |  3
 esc\ape | a\r\\r\ +|  4
 | \nb  |
(4 rows)

test=# copy copytest to stdout (format json);
{"style":"DOS","test":"abc\r\ndef","filler":1}
{"style":"Unix","test":"abc\ndef","filler":2}
{"style":"Mac","test":"abc\rdef","filler":3}
{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
8<---

At this point "COPY TO" should be sending exactly the unaltered output 
of the postgres JSON processing functions.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..0236a9e 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,426 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_row_delimiter_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 445,452 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 544,563 
  defel->de

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Joe Conway


On 12/4/23 09:25, Andrew Dunstan wrote:


On 2023-12-04 Mo 08:37, Joe Conway wrote:

On 12/4/23 07:41, Andrew Dunstan wrote:


On 2023-12-03 Su 20:14, Joe Conway wrote:

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has 
been my whole problem with COPY TO for JSON.


Please validate that the output is in proper format with correct 
quoting for special characters. I use `jq` on the command line to 
validate and format the output.


I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), 
that would be a missing feature (or possibly a bug?).


Davin -- how did you work around the issue with the way the built in 
functions output JSON?


Andrew -- comments/thoughts?


I meant to mention this when I was making comments yesterday.

The patch should not be using CopyAttributeOutText - it will try to
escape characters such as \, which produces the effect complained of
here, or else we need to change its setup so we have a way to inhibit
that escaping.



Interesting.

I am surprised this has never been raised as a problem with COPY TO 
before.


Should the JSON output, as produced by composite_to_json(), be sent 
as-is with no escaping at all? If yes, is JSON somehow unique in this 
regard?



Text mode output is in such a form that it can be read back in using
text mode input. There's nothing special about JSON in this respect -
any text field will be escaped too. But output suitable for text mode
input is not what you're trying to produce here; you're trying to
produce valid JSON.

So, yes, the result of composite_to_json, which is already suitably
escaped, should not be further escaped in this case.


Gotcha.

This patch version uses CopySendData() instead and includes 
documentation changes. Still lacks regression tests.


Hopefully this looks better. Any other particular strings I ought to 
test with?


8<--
test=# copy (select * from foo limit 4) to stdout (format json, 
force_array true);

[
 {"id":1,"f1":"line with \" in it: 
1","f2":"2023-12-03T12:26:41.596053-05:00"}
,{"id":2,"f1":"line with ' in it: 
2","f2":"2023-12-03T12:26:41.596173-05:00"}
,{"id":3,"f1":"line with \" in it: 
3","f2":"2023-12-03T12:26:41.596179-05:00"}
,{"id":4,"f1":"line with ' in it: 
4","f2":"2023-12-03T12:26:41.596182-05:00"}

]
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..46ec34f 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 443,450 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
***

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Joe Conway


On 12/4/23 07:41, Andrew Dunstan wrote:


On 2023-12-03 Su 20:14, Joe Conway wrote:

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has 
been my whole problem with COPY TO for JSON.


Please validate that the output is in proper format with correct 
quoting for special characters. I use `jq` on the command line to 
validate and format the output.


I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), that 
would be a missing feature (or possibly a bug?).


Davin -- how did you work around the issue with the way the built in 
functions output JSON?


Andrew -- comments/thoughts?


I meant to mention this when I was making comments yesterday.

The patch should not be using CopyAttributeOutText - it will try to
escape characters such as \, which produces the effect complained of
here, or else we need to change its setup so we have a way to inhibit
that escaping.



Interesting.

I am surprised this has never been raised as a problem with COPY TO before.

Should the JSON output, as produced by composite_to_json(), be sent 
as-is with no escaping at all? If yes, is JSON somehow unique in this 
regard?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has been 
my whole problem with COPY TO for JSON.

Please validate that the output is in proper format with correct quoting 
for special characters. I use `jq` on the command line to validate and 
format the output.

I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), that 
would be a missing feature (or possibly a bug?).

Davin -- how did you work around the issue with the way the built in 
functions output JSON?

Andrew -- comments/thoughts?

Joe

On Sun, Dec 3, 2023, 10:51 Joe Conway <mailto:m...@joeconway.com>> wrote:

On 12/3/23 10:31, Davin Shearer wrote:
 > Please be sure to include single and double quotes in the test
values
 > since that was the original problem (double quoting in COPY TO
breaking
 > the JSON syntax).

test=# copy (select * from foo limit 4) to stdout (format json);
{"id":2456092,"f1":"line with ' in it:
2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
{"id":2456093,"f1":"line with \\" in it:
2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
{"id":2456094,"f1":"line with ' in it:
2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
{"id":2456095,"f1":"line with \\" in it:
2456095","f2":"2023-12-03T10:44:40.971228-05:00"}

-- 
Joe Conway

PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com <https://aws.amazon.com>

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 14:52, Andrew Dunstan wrote:


On 2023-12-03 Su 14:24, Joe Conway wrote:

On 12/3/23 11:03, Joe Conway wrote:

On 12/3/23 10:10, Andrew Dunstan wrote:
I  realize this is just a POC, but I'd prefer to see 
composite_to_json()

not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


Just FYI, this change does loose some performance in my not massively 
scientific A/B/A test:


8<---



8<---

That is somewhere in the 3% range.


I assume it's because datum_to_json() constructs a text value from which
you then need to extract the cstring, whereas composite_to_json(), just
gives you back the stringinfo. I guess that's a good enough reason to go
with exposing composite_to_json().


Yeah, that was why I went that route in the first place. If you are good 
with it I will go back to that. The code is a bit simpler too.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 11:03, Joe Conway wrote:

On 12/3/23 10:10, Andrew Dunstan wrote:

I  realize this is just a POC, but I'd prefer to see composite_to_json()
not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


Just FYI, this change does loose some performance in my not massively 
scientific A/B/A test:


8<---
-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 37196.898 ms (00:37.197)
Time: 37408.161 ms (00:37.408)
Time: 38393.309 ms (00:38.393)
Time: 36855.438 ms (00:36.855)
Time: 37806.280 ms (00:37.806)

Avg = 37532

-- original patch
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 37426.207 ms (00:37.426)
Time: 36068.187 ms (00:36.068)
Time: 38285.252 ms (00:38.285)
Time: 36971.042 ms (00:36.971)
Time: 35690.822 ms (00:35.691)

Avg = 36888

-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 39083.467 ms (00:39.083)
Time: 37249.326 ms (00:37.249)
Time: 38529.721 ms (00:38.530)
Time: 38704.920 ms (00:38.705)
Time: 39001.326 ms (00:39.001)

Avg = 38513
8<---

That is somewhere in the 3% range.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 11:03, Joe Conway wrote:

  From your earlier post, regarding constructing the aggregate -- not
extensive testing but one data point:
8<--
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 36353.153 ms (00:36.353)
test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
COPY 1
Time: 46835.238 ms (00:46.835)
8<--


Also if the table is large enough, the aggregate method is not even 
feasible whereas the COPY TO method works:

8<--
test=# select count(*) from foo;
  count
--
 2000
(1 row)

test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073741822 bytes by 1 
more bytes.


test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 2000
8<----------

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 10:10, Andrew Dunstan wrote:


On 2023-12-01 Fr 14:28, Joe Conway wrote:

On 11/29/23 10:32, Davin Shearer wrote:

Thanks for the responses everyone.

I worked around the issue using the `psql -tc` method as Filip 
described.


I think it would be great to support writing JSON using COPY TO at 
some point so I can emit JSON to files using a PostgreSQL function 
directly.


-Davin

On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <mailto:fi...@sedlakovi.org>> wrote:


    This would be a very special case for COPY. It applies only to a 
single

    column of JSON values. The original problem can be solved with psql
    --tuples-only as David wrote earlier.


    $ psql -tc 'select json_agg(row_to_json(t))
               from (select * from public.tbl_json_test) t;'

   [{"id":1,"t_test":"here's a \"string\""}]


    Special-casing any encoding/escaping scheme leads to bugs and harder
    parsing.


(moved to hackers)

I did a quick PoC patch (attached) -- if there interest and no hard 
objections I would like to get it up to speed for the January commitfest.


Currently the patch lacks documentation and regression test support.

Questions:
--
1. Is supporting JSON array format sufficient, or does it need to 
support some other options? How flexible does the support scheme need 
to be?


2. This only supports COPY TO and we would undoubtedly want to support 
COPY FROM for JSON as well, but is that required from the start?


Thanks for any feedback.


I  realize this is just a POC, but I'd prefer to see composite_to_json()
not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


I think JSON array format is sufficient.


The other formats make sense from a completeness standpoint (versus 
other databases) and the latest patch already includes them, so I still 
lean toward supporting all three formats.



I can see both sides of the COPY FROM argument, but I think insisting on
that makes this less doable for release 17. On balance I would stick to
COPY TO for now.


WFM.

From your earlier post, regarding constructing the aggregate -- not 
extensive testing but one data point:

8<--
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 36353.153 ms (00:36.353)
test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
COPY 1
Time: 46835.238 ms (00:46.835)
8<--


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 10:31, Davin Shearer wrote:
Please be sure to include single and double quotes in the test values 
since that was the original problem (double quoting in COPY TO breaking 
the JSON syntax).


test=# copy (select * from foo limit 4) to stdout (format json);
{"id":2456092,"f1":"line with ' in it: 
2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
{"id":2456093,"f1":"line with \\" in it: 
2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
{"id":2456094,"f1":"line with ' in it: 
2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
{"id":2456095,"f1":"line with \\" in it: 
2456095","f2":"2023-12-03T10:44:40.971228-05:00"}


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/2/23 17:37, Joe Conway wrote:

On 12/2/23 16:53, Nathan Bossart wrote:

On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote:

So if you are writing a production that might need to match
FORMAT followed by JSON, you need to match FORMAT_LA too.


Thanks for the pointer.  That does seem to be the culprit.

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d631ac89a9..048494dd07 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3490,6 +3490,10 @@ copy_generic_opt_elem:
  {
  $$ = makeDefElem($1, $2, @1);
  }
+| FORMAT_LA copy_generic_opt_arg
+{
+$$ = makeDefElem("format", $2, @1);
+}
  ;
  
  copy_generic_opt_arg:



Yep -- I concluded the same. Thanks Tom!


The attached implements the above repair, as well as adding support for 
array decoration (or not) and/or comma row delimiters when not an array.


This covers the three variations of json import/export formats that I 
have found after light searching (SQL Server and DuckDB).


Still lacks and documentation, tests, and COPY FROM support, but here is 
what it looks like in a nutshell:


8<---
create table foo(id int8, f1 text, f2 timestamptz);
insert into foo
  select g.i,
 'line: ' || g.i::text,
 clock_timestamp()
  from generate_series(1,4) as g(i);

copy foo to stdout (format json);
{"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}

copy foo to stdout (format json, force_array);
[
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]

copy foo to stdout (format json, force_row_delimiter);
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}

copy foo to stdout (force_array);
ERROR:  COPY FORCE_ARRAY requires JSON mode

copy foo to stdout (force_row_delimiter);
ERROR:  COPY FORCE_ROW_DELIMITER requires JSON mode
8<---


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..1f9ac31 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 443,450 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 542,559 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_row_delimiter") == 0)
+ 		{
+ 			if (opts_out->force_row_delimiter)
+ errorConflictingDefElem(defel, pstate);
+ 			opts_out->force_row_delimiter = true;
+ 		}
+ 		else if (strcmp(defel->defname, "force_array") == 0)
+ 		{
+ 			if (opts_out->force_array)
+ errorConflictingDefElem(defel, pstate);
+ 			opts_out->force_array = true;
+ 		}
  		else if (strcmp(defel->defname, "convert_selectively") == 0)
  		{
  			/*
*** ProcessCopyOptions(ParseState *pstate,
*** 598,603 
--- 612,631 
  (errcode(ERRCODE_SYNTAX_ERROR),
   errmsg("cannot specify DEFAULT in BINARY mode")));
  
+ 	if (opts_out->json_mode)
+ 	{
+ 		if (opts_out->force_array)
+ 			opts_out->force_row_delimiter = true;
+ 	}
+ 	else if (opts_out->force_array)
+ 		ereport(ERROR,
+ (errcode(ERR

Re: Emitting JSON to file using COPY TO

2023-12-02 Thread Joe Conway


On 12/2/23 13:50, Maciek Sakrejda wrote:

On Fri, Dec 1, 2023 at 11:32 AM Joe Conway  wrote:

1. Is supporting JSON array format sufficient, or does it need to
support some other options? How flexible does the support scheme need to be?


"JSON Lines" is a semi-standard format [1] that's basically just
newline-separated JSON values. (In fact, this is what
log_destination=jsonlog gives you for Postgres logs, no?) It might be
worthwhile to support that, too.

[1]: https://jsonlines.org/



Yes, I have seen examples of that associated with other databases (MSSQL 
and Duckdb at least) as well. It probably makes sense to support that 
format too.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-02 Thread Joe Conway


On 12/2/23 16:53, Nathan Bossart wrote:

On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote:

So if you are writing a production that might need to match
FORMAT followed by JSON, you need to match FORMAT_LA too.


Thanks for the pointer.  That does seem to be the culprit.

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d631ac89a9..048494dd07 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3490,6 +3490,10 @@ copy_generic_opt_elem:
  {
  $$ = makeDefElem($1, $2, @1);
  }
+| FORMAT_LA copy_generic_opt_arg
+{
+$$ = makeDefElem("format", $2, @1);
+}
  ;
  
  copy_generic_opt_arg:



Yep -- I concluded the same. Thanks Tom!

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-02 Thread Joe Conway


On 12/1/23 18:09, Nathan Bossart wrote:

On Fri, Dec 01, 2023 at 02:28:55PM -0500, Joe Conway wrote:

I did a quick PoC patch (attached) -- if there interest and no hard
objections I would like to get it up to speed for the January commitfest.


Cool.  I would expect there to be interest, given all the other JSON
support that has been added thus far.


Thanks for the review


I noticed that, with the PoC patch, "json" is the only format that must be
quoted.  Without quotes, I see a syntax error.  I'm assuming there's a
conflict with another json-related rule somewhere in gram.y, but I haven't
tracked down exactly which one is causing it.


It seems to be because 'json' is also a type name ($$ = 
SystemTypeName("json")).


What do you think about using 'json_array' instead? It is more specific 
and accurate, and avoids the need to quote.


test=# copy foo to stdout (format json_array);
[
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]


1. Is supporting JSON array format sufficient, or does it need to support
some other options? How flexible does the support scheme need to be?


I don't presently have a strong opinion on this one.  My instinct would be
start with something simple, though.  I don't think we offer any special
options for log_destination...


WFM


2. This only supports COPY TO and we would undoubtedly want to support COPY
FROM for JSON as well, but is that required from the start?


I would vote for including COPY FROM support from the start.


Check. My thought is to only accept the same format we emit -- i.e. only 
take a json array.



!   if (!cstate->opts.json_mode)


I think it's unfortunate that this further complicates the branching in
CopyOneRowTo(), but after some quick glances at the code, I'm not sure it's
worth refactoring a bunch of stuff to make this nicer.


Yeah that was my conclusion.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-01 Thread Joe Conway


On 12/1/23 22:00, Davin Shearer wrote:
I'm really glad to see this taken up as a possible new feature and will 
definitely use it if it gets released.  I'm impressed with how clean, 
understandable, and approachable the postgres codebase is in general and 
how easy it is to read and understand this patch.


I reviewed the patch (though I didn't build and test the code) and have 
a concern with adding the '[' at the beginning and ']' at the end of the 
json output.  Those are already added by `json_agg` 
(https://www.postgresql.org/docs/current/functions-aggregate.html 
<https://www.postgresql.org/docs/current/functions-aggregate.html>) as 
you can see in my initial email.  Adding them in the COPY TO may be 
redundant (e.g., [[{"key":"value"...}]]).


With this patch in place you don't use json_agg() at all. See the 
example output (this is real output with the patch applied):


(oops -- I meant to send this with the same email as the patch)

8<-
create table foo(id int8, f1 text, f2 timestamptz);
insert into foo
  select g.i,
 'line: ' || g.i::text,
 clock_timestamp()
  from generate_series(1,4) as g(i);

copy foo to stdout (format 'json');
[
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]
8<-


I think COPY TO makes good sense to support, though COPY FROM maybe not 
so much as JSON isn't necessarily flat and rectangular like CSV.


Yeah -- definitely not as straight forward but possibly we just support 
the array-of-jsonobj-rows as input as well?


For my use-case, I'm emitting JSON files to Apache NiFi for processing, 
and NiFi has superior handling of JSON (via JOLT parsers) versus CSV 
where parsing is generally done with regex.  I want to be able to emit 
JSON using a postgres function and thus COPY TO.


Definitely +1 for COPY TO.

I don't think COPY FROM will work out well unless the JSON is required 
to be flat and rectangular.  I would vote -1 to leave it out due to the 
necessary restrictions making it not generally useful.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-01 Thread Joe Conway


On 11/29/23 10:32, Davin Shearer wrote:

Thanks for the responses everyone.

I worked around the issue using the `psql -tc` method as Filip described.

I think it would be great to support writing JSON using COPY TO at 
some point so I can emit JSON to files using a PostgreSQL function directly.


-Davin

On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák <mailto:fi...@sedlakovi.org>> wrote:


This would be a very special case for COPY. It applies only to a single
column of JSON values. The original problem can be solved with psql
--tuples-only as David wrote earlier.


$ psql -tc 'select json_agg(row_to_json(t))
               from (select * from public.tbl_json_test) t;'

   [{"id":1,"t_test":"here's a \"string\""}]


Special-casing any encoding/escaping scheme leads to bugs and harder
parsing.


(moved to hackers)

I did a quick PoC patch (attached) -- if there interest and no hard 
objections I would like to get it up to speed for the January commitfest.


Currently the patch lacks documentation and regression test support.

Questions:
--
1. Is supporting JSON array format sufficient, or does it need to 
support some other options? How flexible does the support scheme need to be?


2. This only supports COPY TO and we would undoubtedly want to support 
COPY FROM for JSON as well, but is that required from the start?


Thanks for any feedback.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..bc1f684 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 443,450 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 667,672 
--- 669,679 
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
   errmsg("cannot specify HEADER in BINARY mode")));
  
+ 	if (opts_out->json_mode && opts_out->header_line)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("cannot specify HEADER in JSON mode")));
+ 
  	/* Check quote */
  	if (!opts_out->csv_mode && opts_out->quote != NULL)
  		ereport(ERROR,
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index c66a047..f6ee771 100644
*** a/src/backend/commands/copyto.c
--- b/src/backend/commands/copyto.c
***
*** 37,42 
--- 37,43 
  #include "rewrite/rewriteHandler.h"
  #include "storage/fd.h"
  #include "tcop/tcopprot.h"
+ #include "utils/json.h"
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"
  #include "utils/partcache.h"
*** typedef struct
*** 112,117 
--- 113,120 
  /* NOTE: there's a copy of this in copyfromparse.c */
  static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
  
+ /* need delimiter to start next json array element */
+ static bool json_row_delim_needed = false;
  
  /* non-export function prototypes */
  static void EndCopy(CopyToState cstate);
*** DoCopyTo(CopyToState cstate)
*** 845,850 
--- 848,861 
  
  			CopySendEndOfRow(cstate);
  		}
+ 
+ 		/* if a JSON has been requested send the opening bracket */
+ 		if (cstate->opts.json_mode)
+ 		{
+ 			CopySendChar(cstate, '[');
+ 			CopySendEndOfRow(cstate);
+ 			json_row_delim_needed = false;
+ 		}
  	}
  
  	if (cstate->rel)
*** DoCopyTo(CopyToState cstate)
*** 892,897 
--- 903,915 
  		CopySendEndOfRow(cstate);
  	}
  
+ 	/* if a JSON has been requested send the closing bracket */
+ 	if (cstate->opts.json_mode)
+ 	{
+ 		CopySendChar(cstate, ']');
+ 		CopySendEndOfRow(cstate);
+ 	}
+ 
  	MemoryContextDelete(cstate->rowcontext);
  
  	if (fe_copy)
*** DoCopyTo(CopyToState cstate)
*** 906,916 
  static void
  CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)
  {
- 	bool		need_delim = false;
- 	FmgrInfo   *out_functions = cstate->out_functions;
  	MemoryContext oldcontext;
- 	ListCell   *cur;
- 	char	   *string;
  
  	MemoryContextReset(cstate->rowcontext);
  	oldcontext = MemoryContextSwitchTo(cstate->rowcontext);
--- 924,930 
*** CopyOneRowTo(CopyToState cstate, TupleTa
*** 921,974 
  		CopySendInt16(cstate, list_length(cstate->attnumlist));
  	}
  
! 	/* Make sure the tuple is fully deconstructed */
! 	slot_getallattrs(slot);
! 
! 	foreach(cur, cstate->attnumlist)
  	{
! 		int			attnum = lfirst_int(cur);
! 		Datum		va

Re: How to solve the problem of one backend process crashing and causing other processes to restart?

2023-11-13 Thread Joe Conway

On 11/13/23 00:53, Laurenz Albe wrote:

On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote:

yuansong  writes:
> In PostgreSQL, when a backend process crashes, it can cause other backend
> processes to also require a restart, primarily to ensure data consistency.
> I understand that the correct approach is to analyze and identify the
> cause of the crash and resolve it. However, it is also important to be
> able to handle a backend process crash without affecting the operation of
> other processes, thus minimizing the scope of negative impact and
> improving availability. To achieve this goal, could we mimic the Oracle
> process by introducing a "pmon" process dedicated to rolling back crashed
> process transactions and performing resource cleanup? I wonder if anyone
> has attempted such a strategy or if there have been previous discussions
> on this topic.

The reason we force a database-wide restart is that there's no way to
be certain that the crashed process didn't corrupt anything in shared
memory.  (Even with the forced restart, there's a window where bad
data could reach disk before we kill off the other processes that
might write it.  But at least it's a short window.)  "Corruption"
here doesn't just involve bad data placed into disk buffers; more
often it's things like unreleased locks, which would block other
processes indefinitely.

I seriously doubt that anything like what you're describing
could be made reliable enough to be acceptable.  "Oracle does
it like this" isn't a counter-argument: they have a much different
(and non-extensible) architecture, and they also have an army of
programmers to deal with minutiae like undoing resource acquisition.
Even with that, you'd have to wonder about the number of bugs
existing in such necessarily-poorly-tested code paths.

Yes.
I think that PostgreSQL's approach is superior: rather than investing in
code to mitigate the impact of data corruption caused by a crash, invest
in quality code that doesn't crash in the first place.

While true, this does nothing to prevent OOM kills, which are becoming 
more prevalent as, for example, running Postgres in a container (or 
otherwise) with a cgroup memory limit becomes more popular.

And in any case, there are enterprise use cases that necessarily avoid 
Postgres due to this behavior, which is a shame.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: commitfest app down for repairs

2023-09-30 Thread Joe Conway


On 9/30/23 08:47, Joe Conway wrote:

The committfest app is down for repairs. We will reply back here once it
is back up.

The commitfest app is back up.

We restored to a backup from one day prior. We will take a look at what 
changed in between, but it might be up to folks to redo some things.


A cooling off period was added to the commitfest app for new community 
accounts, similar to what was done with the wiki a few years ago.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

commitfest app down for repairs

2023-09-30 Thread Joe Conway

The committfest app is down for repairs. We will reply back here once it 
is back up.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-09-25 Thread Joe Conway


On 9/25/23 14:03, Jeff Davis wrote:

On Mon, 2023-09-25 at 12:00 -0400, Joe Conway wrote:

Should there be a way to have a separate "execution" search_path?


I hadn't considered that and I like that idea for a few reasons:

   * a lot of the problem cases are for functions that don't need to
access tables at all, e.g., in an index expression.
   * it avoids annoyances with pg_temp, because that's not searched for
functions/operators anyway
   * perhaps we could force the object search_path to be empty for
IMMUTABLE functions?

I haven't thought it through in detail, but it seems like a promising
approach.



Related to this, it would be useful if you could grant create on schema 
for only non-executable objects. You may want to allow a user to create 
their own tables but not allow them to create their own functions, for 
example. Right now "GRANT CREATE ON SCHEMA foo" gives the grantee the 
ability to create "all the things".


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-09-25 Thread Joe Conway


On 9/25/23 11:30, Robert Haas wrote:

I don't believe that people want to run their functions under a
sanitized search_path that only includes system schemas. That might
work for some people, but I think most people will define functions
that call other functions that they themselves defined, or access
tables that they themselves created. They will therefore need the
search_path to include the schemas in which they created those
objects.
Without diving into all the detailed nuances of this discussion, this 
particular paragraph made me wonder if at least part of the problem here 
is that the same search_path is used to find "things that I want to 
execute" (functions and operators) and "things I want to access" 
(tables, etc).


I think many folks would be well served by only having system schemas in 
the search_path for the former (augmented by explicit schema qualifying 
of one's own functions), but agree that almost no one wants that for the 
latter (needing to schema qualify every table reference).


Should there be a way to have a separate "execution" search_path?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: RFC: pg_stat_logmsg

2023-09-13 Thread Joe Conway


On 7/9/23 14:13, Joe Conway wrote:

On 7/7/23 01:38, Gurjeet Singh wrote:

In this case, the error message stored in pg_stat_logmsg is just '%s'.
The filename and line number columns might help identify the actual
error but it requires users to read the source code and cannot know
the actual error message.


I believe that the number of such error messages, the ones with very
little descriptive content, is very low in Postgres code. These kinds
of messages are not prevalent, and must be used when code hits obscure
conditions, like seen in your example above. These are the kinds of
errors that Joe is referring to as "should not happen". In these
cases, even if the error message was descriptive, the user will very
likely have to dive deep into code to find out the real cause.


Agreed. As an example, after running `make installcheck`

8<-
select sum(count) from pg_stat_logmsg;
   sum
--
   6005
(1 row)

select message,sum(count)
from pg_stat_logmsg
where length(message) < 30
and elevel in ('ERROR','FATAL','PANIC')
and message like '%\%s%' escape '\'
group by message
order by length(message);
  message| sum
---+-
   %s| 107
   "%s" is a view|   9
   "%s" is a table   |   4
   %s is a procedure |   1
   invalid size: "%s"|  13
   %s at or near "%s"| 131
   %s at end of input|   3
...
8<-

I only see three message formats there that are generic enough to be of
concern (the first and last two shown -- FWIW I did not see any more of
them as the fmt string gets longer)

So out of 6005 log events, 241 fit this concern.

Perhaps given the small number of message format strings affected, it
would make sense to special case those, but I am not sure it is worth
the effort, at least for version 1.


Attached is an update, this time as a patch against 17devel. It is not 
quite complete, but getting close IMHO.


Changes:

1. Now is integrated into contrib as a "Additional Supplied Extension"

2. Documentation added

3. I had a verbal conversation with Andres and he reminded me that the 
original idea for this was to collect data across fleets of pg hosts so 
we could look for how often "should never happen" errors actually 
happen. As such, we need to think in terms of aggregating the info from 
many hosts, potentially with many localized languages for the messages. 
So I converted the "message" column back to an untranslated string, and 
added a "translated_message" column which is localized.


An alternative approach might be to provide a separate function that 
accepts the message string and returns the translation. Thoughts on that?


4. In the same vein, I added a pgversion column since the filename and 
line number for the same log message could vary across major or even 
minor releases.


Not done:
-
1. The main thing lacking at this point is a regression test.

2. No special casing for message == "%s". I am still not convinced it is 
worthwhile to do so.


Comments gratefully welcomed.

Thanks,

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/contrib/Makefile b/contrib/Makefile
index da4e231..9144cb7 100644
*** a/contrib/Makefile
--- b/contrib/Makefile
*** SUBDIRS = \
*** 34,39 
--- 34,40 
  		pg_buffercache	\
  		pg_freespacemap \
  		pg_prewarm	\
+ 		pg_stat_logmsg \
  		pg_stat_statements \
  		pg_surgery	\
  		pg_trgm		\
diff --git a/contrib/meson.build b/contrib/meson.build
index 84d4e18..4e8ac39 100644
*** a/contrib/meson.build
--- b/contrib/meson.build
*** subdir('pgcrypto')
*** 44,49 
--- 44,50 
  subdir('pg_freespacemap')
  subdir('pg_prewarm')
  subdir('pgrowlocks')
+ subdir('pg_stat_logmsg')
  subdir('pg_stat_statements')
  subdir('pgstattuple')
  subdir('pg_surgery')
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index ab7e38b..4dfc5dd 100644
*** a/doc/src/sgml/contrib.sgml
--- b/doc/src/sgml/contrib.sgml
*** CREATE EXTENSION extension_
*** 157,162 
--- 157,163 
   
   
   
+  
   
   
   
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 4c63a7e..acede66 100644
*** a/doc/src/sgml/filelist.sgml
--- b/doc/src/sgml/filelist.sgml
***
*** 145,150 
--- 145,151 
  
  
  
+ 
  
  
  
diff --git a/contrib/pg_stat_logmsg/.gitignore b/contrib/pg_stat_logmsg/.gitignore
index ...5dcb3ff .
*** a/contrib/pg_stat_logmsg/.gitignore
--- b/contrib/pg_stat_logmsg/.gitignore
***
*** 0 
--- 1,4 
+ # Generated subdirectories
+ /log/
+ /results/
+ /tmp_check/
diff --git a/contrib/pg_stat_logmsg/LICENSE b/contrib/pg_stat_logmsg/LICENSE
index ...998f814 .
*** a/contrib/pg

Re: Possibility to disable `ALTER SYSTEM`

2023-09-07 Thread Joe Conway


On 9/7/23 15:51, Gabriele Bartolini wrote:
I would like to propose a patch that allows administrators to disable 
`ALTER SYSTEM` via either a runt-time option to pass to the Postgres 
server process at startup (e.g. `--disable-alter-system=true`, false by 
default) or a new GUC (or even both), without changing the current 
default method of the server.


Without trying to debate the particulars, a huge +1 for the concept of 
allowing ALTER SYSTEM to be disabled. FWIW I would vote the same for 
control over COPY PROGRAM.


Not coincidentally both concepts were built into set_user: 
https://github.com/pgaudit/set_user


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [17] CREATE SUBSCRIPTION ... SERVER

2023-08-31 Thread Joe Conway


On 8/31/23 12:52, Jeff Davis wrote:

On Thu, 2023-08-31 at 10:59 +0530, Ashutosh Bapat wrote:

The server's FDW has to be postgres_fdw. So we have to handle the
awkward dependency between core and postgres_fdw (an extension).


That sounds more than just "awkward". I can't think of any precedent
for that and it seems to violate the idea of an "extension" entirely.

Can you explain more concretely how we might resolve that?



Maybe move postgres_fdw to be a first class built in feature instead of 
an extension?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG

2023-08-27 Thread Joe Conway


On 8/15/23 10:40, Heikki Linnakangas wrote:

On 01/08/2023 16:48, Joe Conway wrote:

Any further comments on the posted patch[1]? I would like to apply/push
this prior to the beta and minor releases next week.


I'm not sure about the placement of the uselocale() calls. In
plperl_spi_exec(), for example, I think we should switch to the global
locale right after the check_spi_usage_allowed() call. Otherwise, if an
error happens in BeginInternalSubTransaction() or in pg_verifymbstr(),
the error would be processed with the perl locale. Maybe that's
harmless, error processing hardly cares about LC_MONETARY, but seems
wrong in principle.


I guess you could probably argue that we should flip this around, and 
only enter the perl locale when calling into libperl, and exit the perl 
locale every time we reemerge under plperl.c control. That seems pretty 
drastic and potentially messy though.



Hmm, come to think of it, if BeginInternalSubTransaction() throws an
error, we just jump out of the perl interpreter? That doesn't seem cool.
But that's not new with this patch.


Hmm, true enough I guess.


If I'm reading correctly, compile_plperl_function() calls
select_perl_context(), which calls plperl_trusted_init(), which calls
uselocale(). So it leaves locale set to the perl locale. Who sets it back?


No one does it seems, at least not currently


How about adding a small wrapper around eval_pl() that sets and unsets
the locale(), just when we enter the interpreter? It's easier to see
that we are doing the calls in right places, if we make them as close as
possible to entering/exiting the interpreter. Are there other functions
in addition to eval_pl() that need to be called with the perl locale?


I can see that as a better strategy, but "other functions in addition to 
eval_pv()" (I assume you mean eval_pv rather than eval_pl) is a tricky 
one to answer.


I ran the attached script like so (from cwd src/pl/plperl) like so:
```
symbols-used.sh /lib/x86_64-linux-gnu/libperl.so.5.34 plperl.so
```
and get a fairly long list of exported libperl functions that get linked 
into plperl.so:


```
Matched symbols:
boot_DynaLoader
perl_alloc
Perl_av_extend
Perl_av_fetch
Perl_av_len
Perl_av_push
*Perl_call_list
*Perl_call_pv
*Perl_call_sv
perl_construct
Perl_croak
Perl_croak_nocontext
Perl_croak_sv
Perl_croak_xs_usage
Perl_die
*Perl_eval_pv
Perl_free_tmps
Perl_get_sv
Perl_gv_add_by_type
Perl_gv_stashpv
Perl_hv_clear
Perl_hv_common
Perl_hv_common_key_len
Perl_hv_iterinit
Perl_hv_iternext
Perl_hv_iternext_flags
Perl_hv_iternextsv
Perl_hv_ksplit
Perl_looks_like_number
Perl_markstack_grow
Perl_mg_get
Perl_newRV
Perl_newRV_noinc
Perl_newSV
Perl_newSViv
Perl_newSVpv
Perl_newSVpvn
Perl_newSVpvn_flags
Perl_newSVsv
Perl_newSVsv_flags
Perl_newSV_type
Perl_newSVuv
Perl_newXS
Perl_newXS_flags
*perl_parse
Perl_pop_scope
Perl_push_scope
*perl_run
Perl_save_item
Perl_savetmps
Perl_stack_grow
Perl_sv_2bool
Perl_sv_2bool_flags
Perl_sv_2iv
Perl_sv_2iv_flags
Perl_sv_2mortal
Perl_sv_2pv
Perl_sv_2pvbyte
Perl_sv_2pvbyte_flags
Perl_sv_2pv_flags
Perl_sv_2pvutf8
Perl_sv_2pvutf8_flags
Perl_sv_bless
Perl_sv_free
Perl_sv_free2
Perl_sv_isa
Perl_sv_newmortal
Perl_sv_setiv
Perl_sv_setiv_mg
Perl_sv_setsv
Perl_sv_setsv_flags
Perl_sys_init
Perl_sys_init3
Perl_xs_boot_epilog
Perl_xs_handshake
```

I marked the ones that look like perhaps we should care about in the 
above list with an asterisk:


*Perl_call_list
*Perl_call_pv
*Perl_call_sv
*Perl_eval_pv
*perl_run

but perhaps there are others?


/*
 * plperl_xact_callback --- cleanup at main-transaction end.
 */
static void
plperl_xact_callback(XactEvent event, void *arg)
{
/* ensure global locale is the current locale */
if (uselocale((locale_t) 0) != LC_GLOBAL_LOCALE)
perl_locale_obj = uselocale(LC_GLOBAL_LOCALE);
}


So the assumption is that the if current locale is not LC_GLOBAL_LOCALE,
then it was the perl locale. Seems true today, but this could confusion
if anything else calls uselocale(). In particular, if another PL
implementation copies this, and you use plperl and the other PL at the
same time, they would get mixed up. I think we need another "bool
perl_locale_obj_in_use" variable to track explicitly whether the perl
locale is currently active.


Or perhaps don't assume that we want the global locale and swap between 
pg_locale_obj (whatever it is) and perl_locale_obj?



If we are careful to put the uselocale() calls in the right places so
that we never ereport() while in perl locale, this callback isn't
needed. Maybe it's still a good idea, though, to be extra sure that
things get reset to a sane state if something unexpected happens.


I feel more comfortable that we have a "belt and suspenders" method to 
restore the locale that was in use by Postgres before entering perl.



If multiple interpreters are used, is the single perl_locale_obj
variable still enough? Each interpreter can have their own locale I believe.


So in other

Re: list of acknowledgments for PG16

2023-08-22 Thread Joe Conway


On 8/22/23 09:44, Tom Lane wrote:

Alvaro Herrera  writes:

On 2023-Aug-22, Peter Eisentraut wrote:

The list of acknowledgments for the PG16 release notes has been committed.
It should show up here sometime: 
<https://www.postgresql.org/docs/16/release-16.html#RELEASE-16-ACKNOWLEDGEMENTS>.



Hmm, I think these docs would only regenerate during the RC1 release, so
it'll be a couple of weeks, unless we manually poke the doc builder.


Yeah.  I could produce a new set of tarballs from the v16 branch tip,
but I don't know the process (nor have the admin permissions) to
extract the HTML docs and put them on the website.



These days the docs update is part of a scripted process for doing an 
entire release.


I'm sure we could figure out how to just release the updated docs, but 
with RC1 a week away, is it really worthwhile?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: C function to return double precision[][]

2023-08-21 Thread Joe Conway


On 8/21/23 15:31, Markur Sens wrote:

Is there any piece of code I could see how to achieve $subject ?
I haven’t found anything in the standard library or contrib modules.

I’m trying to build ArrayType ** of sorts and return a Datum of those but I 
can’t seem to manage memory correctly.


There is an example in PL/R here:

https://github.com/postgres-plr/plr/blob/20a1f133bcf2bc8f37ac23da191aea590d612619/pg_conversion.c#L1275

which points to here with number of dims == 2:

https://github.com/postgres-plr/plr/blob/20a1f133bcf2bc8f37ac23da191aea590d612619/pg_conversion.c#L1493

This is all generic to the element type (i.e. not specifically float8), 
but the needed type conversion stuff happens in here:


https://github.com/postgres-plr/plr/blob/20a1f133bcf2bc8f37ac23da191aea590d612619/plr.c#L1109

HTH,

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Would it be possible to backpatch Close support in libpq (28b5726) to PG16?

2023-08-16 Thread Joe Conway


On 8/15/23 15:39, Alvaro Herrera wrote:

On 2023-Aug-16, Michael Paquier wrote:


Personally I think backpatching 28b5726 has a really low risk of
breaking anything.


I agree about the low-risk argument, though.  This is just new code.


Here's a way to think about it.  If 16.1 was already out, would we add
libpq support for Close to 16.2?


Seems pretty clearly a "no" to me.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-08-12 Thread Joe Conway


On 8/12/23 09:15, Joe Conway wrote:

On 8/11/23 22:35, Jeff Davis wrote:

2. We can more accurately serve the user's intent. For instance, the
safe search_path of "pg_catalog, pg_temp" is arcane and seems to be
there just because we don't have a way to specify that pg_temp be
excluded entirely. But perhaps in the future we *do* want to exclude
pg_temp entirely. Knowing that the user just wants "SEARCH SYSTEM"
allows us some freedom to do that.


Personally I think having pg_temp in the SYSTEM search path makes sense
for temp tables, but I find it easy to forget that functions can be
created by unprivileged users in pg_temp, and therefore having pg_temp
in the search path for functions is dangerous.


Hmm, I guess I was too hasty -- seems we have some magic related to this 
already.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-08-12 Thread Joe Conway


On 8/11/23 22:35, Jeff Davis wrote:

The attached patch implements a new SEARCH clause for CREATE FUNCTION.
The SEARCH clause controls the search_path used when executing
functions that were created without a SET clause.

Background:

Controlling search_path is critical for the correctness and security of
functions. Right now, the author of a function without a SET clause has
little ability to control the function's behavior, because even basic
operators like "+" involve search_path. This is a big problem for, e.g.
functions used in expression indexes which are called by any user with
write privileges on the table.

Motivation:

I'd like to (eventually) get to safe-by-default behavior. In other
words, the simplest function declaration should be safe for the most
common use cases.


I agree with the general need.


Add SEARCH { DEFAULT | SYSTEM | SESSION } clause to CREATE/ALTER
function.

   * SEARCH DEFAULT is the same as no SEARCH clause at all, and ends up
stored in the catalog as prosearch='d'.
   * SEARCH SYSTEM means that we switch to the safe search path of
"pg_catalog, pg_temp"  when executing the function. Stored as
prosearch='y'.
   * SEARCH SESSION means that we don't switch the search_path when
executing the function, and it's inherited from the session. Stored as
prosearch='e'.


It isn't clear to me what is the precise difference between  DEFAULT and 
SESSION




2. We can more accurately serve the user's intent. For instance, the
safe search_path of "pg_catalog, pg_temp" is arcane and seems to be
there just because we don't have a way to specify that pg_temp be
excluded entirely. But perhaps in the future we *do* want to exclude
pg_temp entirely. Knowing that the user just wants "SEARCH SYSTEM"
allows us some freedom to do that.


Personally I think having pg_temp in the SYSTEM search path makes sense 
for temp tables, but I find it easy to forget that functions can be 
created by unprivileged users in pg_temp, and therefore having pg_temp 
in the search path for functions is dangerous.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG

2023-08-01 Thread Joe Conway


On 7/3/23 12:25, Tristan Partin wrote:

On Sat Jun 24, 2023 at 8:09 AM CDT, Joe Conway wrote:
Although I have not looked yet, presumably we could have similar 
problems with plpython. I would like to get agreement on this approach 
against plperl before diving into that though.


Thoughts?


I don't see anything immediately wrong with this.


Any further comments on the posted patch[1]? I would like to apply/push 
this prior to the beta and minor releases next week.


Joe

[1] 
https://www.postgresql.org/message-id/ec6fa20d-e691-198a-4a13-e761771b9dec%40joeconway.com


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-07-31 Thread Joe Conway


On 7/31/23 12:53, Robert Haas wrote:

On Fri, Jun 30, 2023 at 3:41 AM Jeff Davis  wrote:

I'm not sure that everyone in this thread realizes just how broken it
is to depend on search_path in a functional index at all. And doubly so
if it depends on a schema other than pg_catalog in the search_path.

Let's also not forget that logical replication always uses
search_path=pg_catalog, so if you depend on a different search_path for
any function attached to the table (not just functional indexes, also
functions inside expressions or trigger functions), then those are
already broken in version 15. And if a superuser is executing
maintenance commands, there's little reason to think they'll have the
same search path as the user that created the table.

At some point in the very near future (though I realize that point may
come after version 16), we need to lock down the search path in a lot
of cases (not just maintenance commands), and I don't see any way
around that.


I agree. I think there are actually two interrelated problems here.

One is that virtually all code needs to run with the originally
intended search_path rather than some search_path chosen at another
time and maybe by a different user. If not, it's going to break, or
compromise security, depending on the situation. The other is that
running arbitrary code written by somebody else as yourself is
basically instant death, from a security perspective.


I agree too.

But the analysis of the issue needs to go one step further. Even if the 
search_path does not change from the originally intended one, a newly 
created function can shadow the intended one based on argument coercion 
rules.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

1 2 3 4 >

1 - 100 of 399 matches

Mail list logo