subject:"Emitting JSON to file using COPY TO"

Re: Emitting JSON to file using COPY TO

2024-04-01 Thread jian he

On Sat, Mar 9, 2024 at 9:13 AM jian he  wrote:
>
> On Sat, Mar 9, 2024 at 2:03 AM Joe Conway  wrote:
> >
> > On 3/8/24 12:28, Andrey M. Borodin wrote:
> > > Hello everyone!
> > >
> > > Thanks for working on this, really nice feature!
> > >
> > >> On 9 Jan 2024, at 01:40, Joe Conway  wrote:
> > >>
> > >> Thanks -- will have a look
> > >
> > > Joe, recently folks proposed a lot of patches in this thread that seem 
> > > like diverted from original way of implementation.
> > > As an author of CF entry [0] can you please comment on which patch 
> > > version needs review?
> >
> >
> > I don't know if I agree with the proposed changes, but I have also been
> > waiting to see how the parallel discussion regarding COPY extensibility
> > shakes out.
> >
> > And there were a couple of issues found that need to be tracked down.
> >
> > Additionally I have had time/availability challenges recently.
> >
> > Overall, chances seem slim that this will make it into 17, but I have
> > not quite given up hope yet either.
>
> Hi.
> summary changes I've made in v9 patches at [0]
>
> meta: rebased. Now you need to use `git apply` or `git am`, previously
> copyto_json.007.diff, you need to use GNU patch.
>
>
> at [1], Dean Rasheed found some corner cases when the returned slot's
> tts_tupleDescriptor
> from
> `
> ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0, true);
> processed = ((DR_copy *) cstate->queryDesc->dest)->processed;
> `
> cannot be used for composite_to_json.
> generally DestReceiver->rStartup is to send the TupleDesc to the DestReceiver,
> The COPY TO DestReceiver's rStartup function is copy_dest_startup,
> however copy_dest_startup is a no-op.
> That means to make the final TupleDesc of COPY TO (FORMAT JSON)
> operation bullet proof,
> we need to copy the tupDesc from CopyToState's queryDesc.
> This only applies to when the COPY TO source is a query (example:
> copy (select 1) to stdout), not a table.
> The above is my interpretation.
>

trying to simplify the explanation.
first refer to the struct DestReceiver.
COPY TO (FORMAT JSON), we didn't send the preliminary Tupdesc to the
DestReceiver
via the rStartup function pointer within struct _DestReceiver.

`CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)`
the slot is the final slot returned via query execution.
but we cannot use the tupdesc (slot->tts_tupleDescriptor) to do
composite_to_json.
because the final return slot Tupdesc may change during the query execution.

so we need to copy the tupDesc from CopyToState's queryDesc.

aslo rebased, now we can apply it cleanly.
From 17d9f3765bb74d863e229afed59af7dd923f5379 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Mon, 1 Apr 2024 19:47:40 +0800
Subject: [PATCH v10 1/2] introduce json format for COPY TO operation.  json
 format is only allowed in COPY TO operation.  also cannot be used with header
 option.

 discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
 discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c...@joeconway.com
---
 doc/src/sgml/ref/copy.sgml |   5 ++
 src/backend/commands/copy.c|  13 +++
 src/backend/commands/copyto.c  | 125 -
 src/backend/parser/gram.y  |   8 ++
 src/backend/utils/adt/json.c   |   5 +-
 src/include/commands/copy.h|   1 +
 src/include/utils/json.h   |   2 +
 src/test/regress/expected/copy.out |  54 +
 src/test/regress/sql/copy.sql  |  38 +
 9 files changed, 208 insertions(+), 43 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 33ce7c4e..add84dbb 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -218,9 +218,14 @@ COPY { table_name [ ( text,
   csv (Comma Separated Values),
+  json (JavaScript Object Notation),
   or binary.
   The default is text.
  
+ 
+  The json option is allowed only in
+  COPY TO.
+ 
 

 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f75e1d70..02f16d9e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -497,6 +497,8 @@ ProcessCopyOptions(ParseState *pstate,
  /* default format */ ;
 			else if (strcmp(fmt, "csv") == 0)
 opts_out->csv_mode = true;
+			else if (strcmp(fmt, "json") == 0)
+opts_out->json_mode = true;
 			else if (strcmp(fmt, "binary") == 0)
 opts_out->binary = true;
 			else
@@ -740,6 +742,11 @@ ProcessCopyOptions(ParseState *pstate,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("cannot specify HEADER in BINARY mode")));
 
+	if (opts_out->json_mode && opts_out->header_line)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot specify HEADER in JSON mode")));
+
 	/* Check quote */
 	if (!opts_out->csv_mode && opts_out->quote != NULL)
 		ereport(ERROR,
@@ -817,6 +824,12 @@ ProcessCopyOptions(ParseState *pstate,

Re: Emitting JSON to file using COPY TO

2024-03-08 Thread jian he

On Sat, Mar 9, 2024 at 2:03 AM Joe Conway  wrote:
>
> On 3/8/24 12:28, Andrey M. Borodin wrote:
> > Hello everyone!
> >
> > Thanks for working on this, really nice feature!
> >
> >> On 9 Jan 2024, at 01:40, Joe Conway  wrote:
> >>
> >> Thanks -- will have a look
> >
> > Joe, recently folks proposed a lot of patches in this thread that seem like 
> > diverted from original way of implementation.
> > As an author of CF entry [0] can you please comment on which patch version 
> > needs review?
>
>
> I don't know if I agree with the proposed changes, but I have also been
> waiting to see how the parallel discussion regarding COPY extensibility
> shakes out.
>
> And there were a couple of issues found that need to be tracked down.
>
> Additionally I have had time/availability challenges recently.
>
> Overall, chances seem slim that this will make it into 17, but I have
> not quite given up hope yet either.

Hi.
summary changes I've made in v9 patches at [0]

meta: rebased. Now you need to use `git apply` or `git am`, previously
copyto_json.007.diff, you need to use GNU patch.

at [1], Dean Rasheed found some corner cases when the returned slot's
tts_tupleDescriptor
from
`
ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0, true);
processed = ((DR_copy *) cstate->queryDesc->dest)->processed;
`
cannot be used for composite_to_json.
generally DestReceiver->rStartup is to send the TupleDesc to the DestReceiver,
The COPY TO DestReceiver's rStartup function is copy_dest_startup,
however copy_dest_startup is a no-op.
That means to make the final TupleDesc of COPY TO (FORMAT JSON)
operation bullet proof,
we need to copy the tupDesc from CopyToState's queryDesc.
This only applies to when the COPY TO source is a query (example:
copy (select 1) to stdout), not a table.
The above is my interpretation.

at [2], Masahiko Sawada made several points.
Mainly split the patch to two, one for format json, second is for
options force_array.
Splitting into two is easier to review, I think.
My changes also addressed all the points Masahiko Sawada had mentioned.

[0] 
https://postgr.es/m/cacjufxhd6zrmjjbsdogpovavaekms-u6aorcw0ja-wyi-0k...@mail.gmail.com
[1] 
https://postgr.es/m/CAEZATCWh29787xf=4ngkoixeqrhrqi0qd33z6_-f8t2dz0y...@mail.gmail.com
[2] 
https://postgr.es/m/cad21aocb02zhzm3vxb8hsw8fwosl+irdefb--kmunv8pjpa...@mail.gmail.com

Re: Emitting JSON to file using COPY TO

2024-03-08 Thread Joe Conway


On 3/8/24 12:28, Andrey M. Borodin wrote:

Hello everyone!

Thanks for working on this, really nice feature!


On 9 Jan 2024, at 01:40, Joe Conway  wrote:

Thanks -- will have a look


Joe, recently folks proposed a lot of patches in this thread that seem like 
diverted from original way of implementation.
As an author of CF entry [0] can you please comment on which patch version 
needs review?



I don't know if I agree with the proposed changes, but I have also been 
waiting to see how the parallel discussion regarding COPY extensibility 
shakes out.


And there were a couple of issues found that need to be tracked down.

Additionally I have had time/availability challenges recently.

Overall, chances seem slim that this will make it into 17, but I have 
not quite given up hope yet either.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2024-03-08 Thread Andrey M. Borodin

Hello everyone!

Thanks for working on this, really nice feature!

> On 9 Jan 2024, at 01:40, Joe Conway  wrote:
> 
> Thanks -- will have a look

Joe, recently folks proposed a lot of patches in this thread that seem like 
diverted from original way of implementation.
As an author of CF entry [0] can you please comment on which patch version 
needs review?

Thanks!

Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/47/4716/

Re: Emitting JSON to file using COPY TO

2024-02-18 Thread jian he

On Fri, Jan 19, 2024 at 4:10 PM Masahiko Sawada  wrote:
>
> if (opts_out->json_mode && is_from)
> ereport(ERROR, ...);
>
> if (!opts_out->json_mode && opts_out->force_array)
> ereport(ERROR, ...);
>
> Also these checks can be moved close to other checks at the end of
> ProcessCopyOptions().
>
> ---
> @@ -3395,6 +3395,10 @@ copy_opt_item:
> {
> $$ = makeDefElem("format", (Node *) makeString("csv"), 
> @1);
> }
> +   | JSON
> +   {
> +   $$ = makeDefElem("format", (Node *) makeString("json"), 
> @1);
> +   }
> | HEADER_P
> {
> $$ = makeDefElem("header", (Node *) makeBoolean(true), 
> @1);
> @@ -3427,6 +3431,10 @@ copy_opt_item:
> {
> $$ = makeDefElem("encoding", (Node *) makeString($2), @1);
> }
> +   | FORCE ARRAY
> +   {
> +   $$ = makeDefElem("force_array", (Node *)
> makeBoolean(true), @1);
> +   }
> ;
>
> I believe we don't need to support new options in old-style syntax.
>
you are right about the force_array case.
we don't need to add force_array related changes in gram.y.


On Wed, Jan 31, 2024 at 9:26 PM Alvaro Herrera  wrote:
>
> On 2024-Jan-23, jian he wrote:
>
> > > +   | FORMAT_LA copy_generic_opt_arg
> > > +   {
> > > +   $$ = makeDefElem("format", $2, @1);
> > > +   }
> > > ;
> > >
> > > I think it's not necessary. "format" option is already handled in
> > > copy_generic_opt_elem.
> >
> > test it, I found out this part is necessary.
> > because a query with WITH like `copy (select 1)  to stdout with
> > (format json, force_array false); ` will fail.
>
> Right, because "FORMAT JSON" is turned into FORMAT_LA JSON by parser.c
> (see base_yylex there).  I'm not really sure but I think it might be
> better to make it "| FORMAT_LA JSON" instead of invoking the whole
> copy_generic_opt_arg syntax.  Not because of performance, but just
> because it's much clearer what's going on.
>
I am not sure what alternative you are referring to.
I've rebased the patch, made some cosmetic changes.
Now I think it's pretty neat.
you can, based on it, make your change, then I may understand the
alternative you are referring to.
From b3d3d6023f96aa7971a0663d8c0bd6de50e877a5 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Mon, 19 Feb 2024 10:37:18 +0800
Subject: [PATCH v9 1/2] Add another COPY fomrat: json

this format is only allowed in COPY TO operation.
discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
---
 doc/src/sgml/ref/copy.sgml |   5 ++
 src/backend/commands/copy.c|  13 +++
 src/backend/commands/copyto.c  | 125 -
 src/backend/parser/gram.y  |   8 ++
 src/backend/utils/adt/json.c   |   5 +-
 src/include/commands/copy.h|   1 +
 src/include/utils/json.h   |   2 +
 src/test/regress/expected/copy.out |  54 +
 src/test/regress/sql/copy.sql  |  38 +
 9 files changed, 208 insertions(+), 43 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 55764fc1..ef9e4729 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -214,9 +214,14 @@ COPY { table_name [ ( text,
   csv (Comma Separated Values),
+  json (JavaScript Object Notation),
   or binary.
   The default is text.
  
+ 
+  The json option is allowed only in
+  COPY TO.
+ 
 

 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cc0786c6..5d5b733d 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -480,6 +480,8 @@ ProcessCopyOptions(ParseState *pstate,
  /* default format */ ;
 			else if (strcmp(fmt, "csv") == 0)
 opts_out->csv_mode = true;
+			else if (strcmp(fmt, "json") == 0)
+opts_out->json_mode = true;
 			else if (strcmp(fmt, "binary") == 0)
 opts_out->binary = true;
 			else
@@ -716,6 +718,11 @@ ProcessCopyOptions(ParseState *pstate,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("cannot specify HEADER in BINARY mode")));
 
+	if (opts_out->json_mode && opts_out->header_line)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot specify HEADER in JSON mode")));
+
 	/* Check quote */
 	if (!opts_out->csv_mode && opts_out->quote != NULL)
 		ereport(ERROR,
@@ -793,6 +800,12 @@ ProcessCopyOptions(ParseState *pstate,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("COPY FREEZE cannot be used with COPY TO")));
 
+	/* Check json format  */
+	if (opts_out->json_mode && is_from)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot use JSON mode in COPY FROM")));
+
 	if

Re: Emitting JSON to file using COPY TO

2024-02-02 Thread jian he

On Fri, Feb 2, 2024 at 5:48 PM Alvaro Herrera  wrote:
>
> If you want the server to send this message when the JSON word is not in
> quotes, I'm afraid that's not possible, due to the funny nature of the
> FORMAT keyword when the JSON keyword appears after it.  But why do you
> care?  If you use the patch, then you no longer need to have the "not
> recognized" error messages anymore, because the JSON format is indeed
> a recognized one.
>

"JSON word is not in quotes" is my intention.

Now it seems when people implement any custom format for COPY,
if the format_name is a keyword then we need single quotes.

Thanks for clarifying!

Re: Emitting JSON to file using COPY TO

2024-02-02 Thread Alvaro Herrera

On 2024-Feb-02, jian he wrote:

> copy (select 1) to stdout with  (format json);
> ERROR:  syntax error at or near "format"
> LINE 1: copy (select 1) to stdout with  (format json);
>  ^
> 
> json is a keyword. Is it possible to escape it?
> make `copy (select 1) to stdout with  (format json)` error message the same as
> `copy (select 1) to stdout with  (format json1)`

Sure, you can use 
 copy (select 1) to stdout with  (format "json");
and then you get
ERROR:  COPY format "json" not recognized

is that what you meant?

If you want the server to send this message when the JSON word is not in
quotes, I'm afraid that's not possible, due to the funny nature of the
FORMAT keyword when the JSON keyword appears after it.  But why do you
care?  If you use the patch, then you no longer need to have the "not
recognized" error messages anymore, because the JSON format is indeed
a recognized one.

Maybe I didn't understand your question.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Emitting JSON to file using COPY TO

2024-02-02 Thread jian he

On Wed, Jan 31, 2024 at 9:26 PM Alvaro Herrera  wrote:
>
> On 2024-Jan-23, jian he wrote:
>
> > > +   | FORMAT_LA copy_generic_opt_arg
> > > +   {
> > > +   $$ = makeDefElem("format", $2, @1);
> > > +   }
> > > ;
> > >
> > > I think it's not necessary. "format" option is already handled in
> > > copy_generic_opt_elem.
> >
> > test it, I found out this part is necessary.
> > because a query with WITH like `copy (select 1)  to stdout with
> > (format json, force_array false); ` will fail.
>
> Right, because "FORMAT JSON" is turned into FORMAT_LA JSON by parser.c
> (see base_yylex there).  I'm not really sure but I think it might be
> better to make it "| FORMAT_LA JSON" instead of invoking the whole
> copy_generic_opt_arg syntax.  Not because of performance, but just
> because it's much clearer what's going on.
>

sorry to bother you.
Now I didn't apply any patch, just at the master.
I don't know much about gram.y.

copy (select 1)  to stdout with  (format json1);
ERROR:  COPY format "json1" not recognized
LINE 1: copy (select 1)  to stdout with  (format json1);
  ^
copy (select 1) to stdout with  (format json);
ERROR:  syntax error at or near "format"
LINE 1: copy (select 1) to stdout with  (format json);
 ^

json is a keyword. Is it possible to escape it?
make `copy (select 1) to stdout with  (format json)` error message the same as
`copy (select 1) to stdout with  (format json1)`

Re: Emitting JSON to file using COPY TO

2024-01-31 Thread Alvaro Herrera

On 2024-Jan-23, jian he wrote:

> > +   | FORMAT_LA copy_generic_opt_arg
> > +   {
> > +   $$ = makeDefElem("format", $2, @1);
> > +   }
> > ;
> >
> > I think it's not necessary. "format" option is already handled in
> > copy_generic_opt_elem.
> 
> test it, I found out this part is necessary.
> because a query with WITH like `copy (select 1)  to stdout with
> (format json, force_array false); ` will fail.

Right, because "FORMAT JSON" is turned into FORMAT_LA JSON by parser.c
(see base_yylex there).  I'm not really sure but I think it might be
better to make it "| FORMAT_LA JSON" instead of invoking the whole
copy_generic_opt_arg syntax.  Not because of performance, but just
because it's much clearer what's going on.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Emitting JSON to file using COPY TO

2024-01-31 Thread Junwang Zhao

Hi Vignesh,

On Wed, Jan 31, 2024 at 5:50 PM vignesh C  wrote:
>
> On Sat, 27 Jan 2024 at 11:25, Junwang Zhao  wrote:
> >
> > Hi hackers,
> >
> > Kou-san(CCed) has been working on *Make COPY format extendable[1]*, so
> > I think making *copy to json* based on that work might be the right 
> > direction.
> >
> > I write an extension for that purpose, and here is the patch set together
> > with Kou-san's *extendable copy format* implementation:
> >
> > 0001-0009 is the implementation of extendable copy format
> > 00010 is the pg_copy_json extension
> >
> > I also created a PR[2] if anybody likes the github review style.
> >
> > The *extendable copy format* feature is still being developed, I post this
> > email in case the patch set in this thread is committed without knowing
> > the *extendable copy format* feature.
> >
> > I'd like to hear your opinions.
>
> CFBot shows that one of the test is failing as in [1]:
> [05:46:41.678] /bin/sh: 1: cannot open
> /tmp/cirrus-ci-build/contrib/pg_copy_json/sql/test_copy_format.sql: No
> such file
> [05:46:41.678] diff:
> /tmp/cirrus-ci-build/contrib/pg_copy_json/expected/test_copy_format.out:
> No such file or directory
> [05:46:41.678] diff:
> /tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out:
> No such file or directory
> [05:46:41.678] # diff command failed with status 512: diff
> "/tmp/cirrus-ci-build/contrib/pg_copy_json/expected/test_copy_format.out"
> "/tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out"
> > "/tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out.diff"
> [05:46:41.678] Bail out!make[2]: *** [../../src/makefiles/pgxs.mk:454:
> check] Error 2
> [05:46:41.679] make[1]: *** [Makefile:96: check-pg_copy_json-recurse] Error 2
> [05:46:41.679] make: *** [GNUmakefile:71: check-world-contrib-recurse] Error 2
>
> Please post an updated version for the same.

Thanks for the reminder, the patch set I posted is not for commit but
for further discussion.

I will post more information about the *extendable copy* feature
when it's about to be committed.

>
> [1] - https://cirrus-ci.com/task/5322439115145216
>
> Regards,
> Vignesh



-- 
Regards
Junwang Zhao

Re: Emitting JSON to file using COPY TO

2024-01-31 Thread vignesh C

On Sat, 27 Jan 2024 at 11:25, Junwang Zhao  wrote:
>
> Hi hackers,
>
> Kou-san(CCed) has been working on *Make COPY format extendable[1]*, so
> I think making *copy to json* based on that work might be the right direction.
>
> I write an extension for that purpose, and here is the patch set together
> with Kou-san's *extendable copy format* implementation:
>
> 0001-0009 is the implementation of extendable copy format
> 00010 is the pg_copy_json extension
>
> I also created a PR[2] if anybody likes the github review style.
>
> The *extendable copy format* feature is still being developed, I post this
> email in case the patch set in this thread is committed without knowing
> the *extendable copy format* feature.
>
> I'd like to hear your opinions.

CFBot shows that one of the test is failing as in [1]:
[05:46:41.678] /bin/sh: 1: cannot open
/tmp/cirrus-ci-build/contrib/pg_copy_json/sql/test_copy_format.sql: No
such file
[05:46:41.678] diff:
/tmp/cirrus-ci-build/contrib/pg_copy_json/expected/test_copy_format.out:
No such file or directory
[05:46:41.678] diff:
/tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out:
No such file or directory
[05:46:41.678] # diff command failed with status 512: diff
"/tmp/cirrus-ci-build/contrib/pg_copy_json/expected/test_copy_format.out"
"/tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out"
> "/tmp/cirrus-ci-build/contrib/pg_copy_json/results/test_copy_format.out.diff"
[05:46:41.678] Bail out!make[2]: *** [../../src/makefiles/pgxs.mk:454:
check] Error 2
[05:46:41.679] make[1]: *** [Makefile:96: check-pg_copy_json-recurse] Error 2
[05:46:41.679] make: *** [GNUmakefile:71: check-world-contrib-recurse] Error 2

Please post an updated version for the same.

[1] - https://cirrus-ci.com/task/5322439115145216

Regards,
Vignesh

Re: Emitting JSON to file using COPY TO

2024-01-22 Thread jian he

On Fri, Jan 19, 2024 at 4:10 PM Masahiko Sawada  wrote:
>
> If I'm not missing, copyto_json.007.diff is the latest patch but it
> needs to be rebased to the current HEAD. Here are random comments:
>

please check the latest version.

>  if (opts_out->json_mode)
> +   {
> +   if (is_from)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +errmsg("cannot use JSON mode in COPY FROM")));
> +   }
> +   else if (opts_out->force_array)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +errmsg("COPY FORCE_ARRAY requires JSON mode")));
>
> I think that flatting these two condition make the code more readable:

I make it two condition check

> if (opts_out->json_mode && is_from)
> ereport(ERROR, ...);
>
> if (!opts_out->json_mode && opts_out->force_array)
> ereport(ERROR, ...);
>
> Also these checks can be moved close to other checks at the end of
> ProcessCopyOptions().
>
Yes. I did it, please check it.

> @@ -3395,6 +3395,10 @@ copy_opt_item:
> {
> $$ = makeDefElem("format", (Node *) makeString("csv"), 
> @1);
> }
> +   | JSON
> +   {
> +   $$ = makeDefElem("format", (Node *) makeString("json"), 
> @1);
> +   }
> | HEADER_P
> {
> $$ = makeDefElem("header", (Node *) makeBoolean(true), 
> @1);
> @@ -3427,6 +3431,10 @@ copy_opt_item:
> {
> $$ = makeDefElem("encoding", (Node *) makeString($2), @1);
> }
> +   | FORCE ARRAY
> +   {
> +   $$ = makeDefElem("force_array", (Node *)
> makeBoolean(true), @1);
> +   }
> ;
>
> I believe we don't need to support new options in old-style syntax.
>
> ---
> @@ -3469,6 +3477,10 @@ copy_generic_opt_elem:
> {
> $$ = makeDefElem($1, $2, @1);
> }
> +   | FORMAT_LA copy_generic_opt_arg
> +   {
> +   $$ = makeDefElem("format", $2, @1);
> +   }
> ;
>
> I think it's not necessary. "format" option is already handled in
> copy_generic_opt_elem.
>

test it, I found out this part is necessary.
because a query with WITH like `copy (select 1)  to stdout with
(format json, force_array false); ` will fail.

> ---
> +/* need delimiter to start next json array element */
> +static bool json_row_delim_needed = false;
>
> I think it's cleaner to include json_row_delim_needed into CopyToStateData.
yes. I agree. So I did it.

> ---
> Splitting the patch into two patches: add json format and add
> force_array option would make reviews easy.
>
done. one patch for json format, another one for force_array option.

I also made the following cases fail.
copy copytest to stdout (format csv, force_array false);
ERROR:  specify COPY FORCE_ARRAY is only allowed in JSON mode.

If copy to table then call table_scan_getnextslot no need to worry
about the Tupdesc.
however if we copy a query output as format json, we may need to consider it.

cstate->queryDesc->tupDesc is the output of Tupdesc, we can rely on this.
for copy a query result to json, I memcpy( cstate->queryDesc->tupDesc)
to the the slot's slot->tts_tupleDescriptor
so composite_to_json can use cstate->queryDesc->tupDesc to do the work.
I guess this will make it more bullet-proof.
From 214ad534d13730cba13008798c3d70f8b363436f Mon Sep 17 00:00:00 2001
From: jian he 
Date: Tue, 23 Jan 2024 12:26:43 +0800
Subject: [PATCH v8 2/2] Add force_array for COPY TO json fomrat.

make add open brackets and close for the whole output.
separate each json row with comma after the first row.
---
 doc/src/sgml/ref/copy.sgml | 14 ++
 src/backend/commands/copy.c| 17 +
 src/backend/commands/copyto.c  | 30 ++
 src/backend/parser/gram.y  |  4 
 src/include/commands/copy.h|  1 +
 src/test/regress/expected/copy.out | 24 
 src/test/regress/sql/copy.sql  | 10 ++
 7 files changed, 100 insertions(+)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index ccd90b61..d19332ac 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -43,6 +43,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 FORCE_NOT_NULL { ( column_name [, ...] ) | * }
 FORCE_NULL { ( column_name [, ...] ) | * }
+FORCE_ARRAY [ boolean ]
 ON_ERROR 'error_action'
 ENCODING 'encoding_name'
 
@@ -379,6 +380,19 @@ COPY { table_name [ ( 

 
+   
+FORCE_ARRAY
+
+ 
+  Force output of square brackets as array decorations at the beginning
+  and end of output, and commas between the rows. It is allowed only in
+  COPY TO, and only when using
+  JSON format. The default is
+  false.
+ 
+
+   
+

 ON_ERROR

Re: Emitting JSON to file using COPY TO

2024-01-19 Thread Masahiko Sawada

On Thu, Dec 7, 2023 at 10:10 AM Joe Conway  wrote:
>
> On 12/6/23 18:09, Joe Conway wrote:
> > On 12/6/23 14:47, Joe Conway wrote:
> >> On 12/6/23 13:59, Daniel Verite wrote:
> >>> Andrew Dunstan wrote:
> >>>
>  IMNSHO, we should produce either a single JSON
>  document (the ARRAY case) or a series of JSON documents, one per row
>  (the LINES case).
> >>>
> >>> "COPY Operations" in the doc says:
> >>>
> >>> " The backend sends a CopyOutResponse message to the frontend, followed
> >>> by zero or more CopyData messages (always one per row), followed by
> >>> CopyDone".
> >>>
> >>> In the ARRAY case, the first messages with the copyjsontest
> >>> regression test look like this (tshark output):
> >>>
> >>> PostgreSQL
> >>>  Type: CopyOut response
> >>>  Length: 13
> >>>  Format: Text (0)
> >>>  Columns: 3
> >>> Format: Text (0)
> >>> PostgreSQL
> >>>  Type: Copy data
> >>>  Length: 6
> >>>  Copy data: 5b0a
> >>> PostgreSQL
> >>>  Type: Copy data
> >>>  Length: 76
> >>>  Copy data:
> >>> 207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…
> >>>
> >>> The first Copy data message with contents "5b0a" does not qualify
> >>> as a row of data with 3 columns as advertised in the CopyOut
> >>> message. Isn't that a problem?
> >>
> >>
> >> Is it a real problem, or just a bit of documentation change that I missed?
> >>
> >> Anything receiving this and looking for a json array should know how to
> >> assemble the data correctly despite the extra CopyData messages.
> >
> > Hmm, maybe the real problem here is that Columns do not equal "3" for
> > the json mode case -- that should really say "1" I think, because the
> > row is not represented as 3 columns but rather 1 json object.
> >
> > Does that sound correct?
> >
> > Assuming yes, there is still maybe an issue that there are two more
> > "rows" that actual output rows (the "[" and the "]"), but maybe those
> > are less likely to cause some hazard?
>
>
> The attached should fix the CopyOut response to say one column. I.e. it
> ought to look something like:
>
> PostgreSQL
>   Type: CopyOut response
>   Length: 13
>   Format: Text (0)
>   Columns: 1
>   Format: Text (0)
> PostgreSQL
>   Type: Copy data
>   Length: 6
>   Copy data: 5b0a
> PostgreSQL
>   Type: Copy data
>   Length: 76
>   Copy data: [...]
>

If I'm not missing, copyto_json.007.diff is the latest patch but it
needs to be rebased to the current HEAD. Here are random comments:

---
 if (opts_out->json_mode)
+   {
+   if (is_from)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot use JSON mode in COPY FROM")));
+   }
+   else if (opts_out->force_array)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("COPY FORCE_ARRAY requires JSON mode")));

I think that flatting these two condition make the code more readable:

if (opts_out->json_mode && is_from)
ereport(ERROR, ...);

if (!opts_out->json_mode && opts_out->force_array)
ereport(ERROR, ...);

Also these checks can be moved close to other checks at the end of
ProcessCopyOptions().

---
@@ -3395,6 +3395,10 @@ copy_opt_item:
{
$$ = makeDefElem("format", (Node *) makeString("csv"), @1);
}
+   | JSON
+   {
+   $$ = makeDefElem("format", (Node *) makeString("json"), @1);
+   }
| HEADER_P
{
$$ = makeDefElem("header", (Node *) makeBoolean(true), @1);
@@ -3427,6 +3431,10 @@ copy_opt_item:
{
$$ = makeDefElem("encoding", (Node *) makeString($2), @1);
}
+   | FORCE ARRAY
+   {
+   $$ = makeDefElem("force_array", (Node *)
makeBoolean(true), @1);
+   }
;

I believe we don't need to support new options in old-style syntax.

---
@@ -3469,6 +3477,10 @@ copy_generic_opt_elem:
{
$$ = makeDefElem($1, $2, @1);
}
+   | FORMAT_LA copy_generic_opt_arg
+   {
+   $$ = makeDefElem("format", $2, @1);
+   }
;

I think it's not necessary. "format" option is already handled in
copy_generic_opt_elem.

---
+/* need delimiter to start next json array element */
+static bool json_row_delim_needed = false;

I think it's cleaner to include json_row_delim_needed into CopyToStateData.

---
Splitting the patch into two patches: add json format and add
force_array option would make reviews easy.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2024-01-15 Thread jian he

On Tue, Jan 16, 2024 at 11:46 AM jian he  wrote:
>
>
> I think the reason is maybe related to the function copy_dest_startup.
I was wrong about this sentence.

in the function CopyOneRowTo `if (!cstate->opts.json_mode)` else branch
change to the following:

else
{
Datum rowdata;
StringInfo result;
if (slot->tts_tupleDescriptor->natts == 1)
{
/* Flat-copy the attribute array */
memcpy(TupleDescAttr(slot->tts_tupleDescriptor, 0),
TupleDescAttr(cstate->queryDesc->tupDesc, 0),
1 * sizeof(FormData_pg_attribute));
}
BlessTupleDesc(slot->tts_tupleDescriptor);
rowdata = ExecFetchSlotHeapTupleDatum(slot);
result = makeStringInfo();
composite_to_json(rowdata, result, false);
if (json_row_delim_needed &&
cstate->opts.force_array)
{
CopySendChar(cstate, ',');
}
else if (cstate->opts.force_array)
{
/* first row needs no delimiter */
CopySendChar(cstate, ' ');
json_row_delim_needed = true;
}
CopySendData(cstate, result->data, result->len);
}

all the cases work, more like a hack.
because I cannot fully explain it to you why it works.
---
demo


drop function if exists execute_into_test cascade;
NOTICE:  function execute_into_test() does not exist, skipping
DROP FUNCTION
drop type if exists execute_into_test cascade;
NOTICE:  type "execute_into_test" does not exist, skipping
DROP TYPE
create type eitype as (i integer, y integer);
CREATE TYPE
create or replace function execute_into_test() returns eitype as $$
declare
_v eitype;
begin
execute 'select 1,2' into _v;
return _v;
end; $$ language plpgsql;
CREATE FUNCTION

COPY (SELECT 1 from generate_series(1,1) g) TO stdout WITH (format json);
{"?column?":1}
COPY (SELECT g from generate_series(1,1) g) TO stdout WITH (format json);
{"g":1}
COPY (SELECT g,1 from generate_series(1,1) g) TO stdout WITH (format json);
{"g":1,"?column?":1}
COPY (select * from execute_into_test()) TO stdout WITH (format json);
{"i":1,"y":2}
COPY (select * from execute_into_test() sub) TO stdout WITH (format json);
{"i":1,"y":2}
COPY (select sub from execute_into_test() sub) TO stdout WITH (format json);
{"sub":{"i":1,"y":2}}
COPY (select sub.i from execute_into_test() sub) TO stdout WITH (format json);
{"i":1}
COPY (select sub.y from execute_into_test() sub) TO stdout WITH (format json);
{"y":2}
COPY (VALUES (1), (2)) TO stdout WITH (format json);
{"column1":1}
{"column1":2}
 COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);
{"?column?":1}
{"?column?":2}

Re: Emitting JSON to file using COPY TO

2024-01-15 Thread jian he

On Tue, Jan 9, 2024 at 4:40 AM Joe Conway  wrote:
>
> On 1/8/24 14:36, Dean Rasheed wrote:
> > On Thu, 7 Dec 2023 at 01:10, Joe Conway  wrote:
> >>
> >> The attached should fix the CopyOut response to say one column.
> >>
> >
> > Playing around with this, I found a couple of cases that generate an error:
> >
> > COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);
> >
> > COPY (VALUES (1), (2)) TO stdout WITH (format json);
> >
> > both of those generate the following:
> >
> > ERROR:  record type has not been registered
>
>
> Thanks -- will have a look
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>
>
>

In the function CopyOneRowTo, I try to call the function BlessTupleDesc again.

+BlessTupleDesc(slot->tts_tupleDescriptor);
rowdata = ExecFetchSlotHeapTupleDatum(slot);

Please check the attachment. (one test.sql file, one patch, one bless twice).

Now the error cases are gone, less cases return error.
but the new result is not the expected.

`COPY (SELECT g from generate_series(1,1) g) TO stdout WITH (format json);`
returns
{"":1}
The expected result would be `{"g":1}`.

I think the reason is maybe related to the function copy_dest_startup.

test.sql
Description: application/sql

patch.out
Description: Binary data

patch_bless_twice.out
Description: Binary data

Re: Emitting JSON to file using COPY TO

2024-01-08 Thread Joe Conway


On 1/8/24 14:36, Dean Rasheed wrote:

On Thu, 7 Dec 2023 at 01:10, Joe Conway  wrote:


The attached should fix the CopyOut response to say one column.



Playing around with this, I found a couple of cases that generate an error:

COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);

COPY (VALUES (1), (2)) TO stdout WITH (format json);

both of those generate the following:

ERROR:  record type has not been registered



Thanks -- will have a look

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2024-01-08 Thread Dean Rasheed

On Thu, 7 Dec 2023 at 01:10, Joe Conway  wrote:
>
> The attached should fix the CopyOut response to say one column.
>

Playing around with this, I found a couple of cases that generate an error:

COPY (SELECT 1 UNION ALL SELECT 2) TO stdout WITH (format json);

COPY (VALUES (1), (2)) TO stdout WITH (format json);

both of those generate the following:

ERROR:  record type has not been registered

Regards,
Dean

Re: Emitting JSON to file using COPY TO

2023-12-09 Thread Hannu Krosing

>

On Sat, Dec 2, 2023 at 4:11 PM Tom Lane  wrote:
>
> Joe Conway  writes:
> >> I noticed that, with the PoC patch, "json" is the only format that must be
> >> quoted.  Without quotes, I see a syntax error.


In longer term we should move any specific COPY flag names and values
out of grammar and their checking into the parts that actually
implement whatever the flag is influencing

Similar to what we do with OPTIONS in all levels of FDW  definitions
(WRAPPER itself, SERVER, USER MAPPING, FOREIGN TABLE)

[*] https://www.postgresql.org/docs/current/sql-createforeigndatawrapper.html

> >> I'm assuming there's a
> >> conflict with another json-related rule somewhere in gram.y, but I haven't
> >> tracked down exactly which one is causing it.
>
> While I've not looked too closely, I suspect this might be due to the
> FORMAT_LA hack in base_yylex:
>
> /* Replace FORMAT by FORMAT_LA if it's followed by JSON */
> switch (next_token)
> {
> case JSON:
> cur_token = FORMAT_LA;
> break;
> }

My hope is that turning the WITH into a fully independent part with no
grammar-defined keys or values would also solve the issue of quoting
"json".

For backwards compatibility we may even go the route of keeping the
WITH as is  but add the OPTIONS which can take any values at grammar
level.

I shared my "Pluggable Copy " talk slides from Berlin '22 in another thread

--
Hannu

Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Joe Conway


On 12/8/23 14:45, Daniel Verite wrote:

Joe Conway wrote:


copyto_json.007.diff


When the source has json fields with non-significant line feeds, the COPY
output has these line feeds too, which makes the output incompatible
with rule #2 at https://jsonlines.org  ("2. Each Line is a Valid JSON
Value").

create table j(f json);

insert into j values('{"a":1,
"b":2
}');

copy j to stdout (format json);

Result:
{"f":{"a":1,
"b":2
}}

Is that expected? copy.sgml in 007 doesn't describe the output
in terms of lines so it's hard to tell from the doc.


The patch as-is just does the equivalent of row_to_json():
8<
select row_to_json(j) from j;
 row_to_json
--
 {"f":{"a":1,+
 "b":2   +
 }}
(1 row)
8<

So yeah, that is how it works today. I will take a look at what it would 
take to fix it.



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Daniel Verite

Dave Cramer wrote:

> > This argument for leaving 3 as the column count makes sense to me.  I
> > agree this content is not meant to facilitate interpreting the contents at
> > a protocol level.
> >
> 
> I'd disagree. From my POV if the data comes back as a JSON Array this is
> one object and this should be reflected in the column count.

The doc says this:
"Int16
The number of columns in the data to be copied (denoted N below)."

and this formulation is repeated in PQnfields() for libpq:

"PQnfields
Returns the number of columns (fields) to be copied."

How to interpret that sentence? 
"to be copied" from what, into what, and by what way?

A plausible interpretation is "to be copied from the source data
into the COPY stream, by the backend".  So the number of columns
to be copied likely refers to the columns of the dataset, not the
"in-transit form" that is text or csv or json.

The interpetation you're proposing also makes sense, that it's just
one json column per row, or even a single-row single-column for the
entire dataset in the force_array case, but then the question is why
isn't that number of columns always 1 for the original "text" format,
since each row is represented in the stream as a single long piece of
text?

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Daniel Verite

Joe Conway wrote:

> copyto_json.007.diff

When the source has json fields with non-significant line feeds, the COPY
output has these line feeds too, which makes the output incompatible
with rule #2 at https://jsonlines.org  ("2. Each Line is a Valid JSON
Value").

create table j(f json);

insert into j values('{"a":1,
"b":2
}');

copy j to stdout (format json);

Result:
{"f":{"a":1,
"b":2
}}

Is that expected? copy.sgml in 007 doesn't describe the output
in terms of lines so it's hard to tell from the doc.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Emitting JSON to file using COPY TO

2023-12-08 Thread Dave Cramer

On Thu, 7 Dec 2023 at 08:47, David G. Johnston 
wrote:

> On Thursday, December 7, 2023, Daniel Verite 
> wrote:
>
>> Joe Conway wrote:
>>
>> > The attached should fix the CopyOut response to say one column. I.e. it
>> > ought to look something like:
>>
>> Spending more time with the doc I came to the opinion that in this bit
>> of the protocol, in CopyOutResponse (B)
>> ...
>> Int16
>> The number of columns in the data to be copied (denoted N below).
>> ...
>>
>> this number must be the number of columns in the source.
>> That is for COPY table(a,b,c)   the number is 3, independently
>> on whether the result is formatted in text, cvs, json or binary.
>>
>> I think that changing it for json can reasonably be interpreted
>> as a protocol break and we should not do it.
>>
>> The fact that this value does not help parsing the CopyData
>> messages that come next is not a new issue. A reader that
>> doesn't know the field separator and whether it's text or csv
>> cannot parse these messages into fields anyway.
>> But just knowing how much columns there are in the original
>> data might be useful by itself and we don't want to break that.
>
>
> This argument for leaving 3 as the column count makes sense to me.  I
> agree this content is not meant to facilitate interpreting the contents at
> a protocol level.
>

I'd disagree. From my POV if the data comes back as a JSON Array this is
one object and this should be reflected in the column count.

>
>
>>
>>
>> The other question for me is, in the CopyData message, this
>> bit:
>> " Messages sent from the backend will always correspond to single data
>> rows"
>>
>> ISTM that considering that the "[" starting the json array is a
>> "data row" is a stretch.
>> That might be interpreted as a protocol break, depending
>> on how strict the interpretation is.
>>
>
Well technically it is a single row if you send an array.

Regardless, I expect Euler's comment above that JSON lines format is going
to be the preferred format as the client doesn't have to wait for the
entire object before starting to parse.

Dave

>

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway


On 12/7/23 09:11, David G. Johnston wrote:
Those are all the same breakage though - if truly interpreted as data 
rows the protocol is basically written such that the array format is not 
supportable and only the lines format can be used.  Hence my “format 0 
doesn’t work” comment for array output and we should explicitly add 
format 2 where we explicitly decouple lines of output from rows of 
data.  That said, it would seem in practice format 0 already decouples 
them and so the current choice of the brackets on their own lines is 
acceptable.


I’d prefer to keep them on their own line.


WFM ¯\_(ツ)_/¯

I am merely responding with options to the many people opining on the 
thread.


I also don’t know why you introduced another level of object nesting 
here.  That seems quite undesirable.


I didn't add anything. It is an artifact of the particular query I wrote 
in the copy to statement (I did "select ss from ss" instead of "select * 
from ss"), mea culpa.


This is what the latest patch, as written today, outputs:
8<--
copy
(select 1, g.i from generate_series(1, 3) g(i))
to stdout (format json, force_array);
[
 {"?column?":1,"i":1}
,{"?column?":1,"i":2}
,{"?column?":1,"i":3}
]
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread David G. Johnston

On Thursday, December 7, 2023, Joe Conway  wrote:

> On 12/7/23 08:35, Daniel Verite wrote:
>
>> Joe Conway wrote:
>>
>> The attached should fix the CopyOut response to say one column. I.e. it
>>> ought to look something like:
>>>
>>
>> Spending more time with the doc I came to the opinion that in this bit
>> of the protocol, in CopyOutResponse (B)
>> ...
>> Int16
>> The number of columns in the data to be copied (denoted N below).
>> ...
>>
>> this number must be the number of columns in the source.
>> That is for COPY table(a,b,c)   the number is 3, independently
>> on whether the result is formatted in text, cvs, json or binary.
>>
>> I think that changing it for json can reasonably be interpreted
>> as a protocol break and we should not do it.
>>
>> The fact that this value does not help parsing the CopyData
>> messages that come next is not a new issue. A reader that
>> doesn't know the field separator and whether it's text or csv
>> cannot parse these messages into fields anyway.
>> But just knowing how much columns there are in the original
>> data might be useful by itself and we don't want to break that.
>>
>
> Ok, that sounds reasonable to me -- I will revert that change.
>
> The other question for me is, in the CopyData message, this
>> bit:
>> " Messages sent from the backend will always correspond to single data
>> rows"
>>
>> ISTM that considering that the "[" starting the json array is a
>> "data row" is a stretch.
>> That might be interpreted as a protocol break, depending
>> on how strict the interpretation is.
>>
>
> If we really think that is a problem I can see about changing it to this
> format for json array:
>
> 8<--
> copy
> (
>   with ss(f1, f2) as
>   (
> select 1, g.i from generate_series(1, 3) g(i)
>   )
>   select ss from ss
> ) to stdout (format json, force_array);
> [{"ss":{"f1":1,"f2":1}}
> ,{"ss":{"f1":1,"f2":2}}
> ,{"ss":{"f1":1,"f2":3}}]
> 8<--
>
> Is this acceptable to everyone?
>
> Or maybe this is preferred?
> 8<--
> [{"ss":{"f1":1,"f2":1}},
>  {"ss":{"f1":1,"f2":2}},
>  {"ss":{"f1":1,"f2":3}}]
> 8<--
>
> Or as long as we are painting the shed, maybe this?
> 8<--
> [{"ss":{"f1":1,"f2":1}},
> {"ss":{"f1":1,"f2":2}},
> {"ss":{"f1":1,"f2":3}}]
> 8<--
>

Those are all the same breakage though - if truly interpreted as data rows
the protocol is basically written such that the array format is not
supportable and only the lines format can be used.  Hence my “format 0
doesn’t work” comment for array output and we should explicitly add format
2 where we explicitly decouple lines of output from rows of data.  That
said, it would seem in practice format 0 already decouples them and so the
current choice of the brackets on their own lines is acceptable.

I’d prefer to keep them on their own line.

I also don’t know why you introduced another level of object nesting here.
That seems quite undesirable.

David J.

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway


On 12/7/23 08:52, Joe Conway wrote:

Or maybe this is preferred?
8<--
[{"ss":{"f1":1,"f2":1}},
   {"ss":{"f1":1,"f2":2}},
   {"ss":{"f1":1,"f2":3}}]
8<--


I don't know why my mail client keeps adding extra spaces, but the 
intention here is a single space in front of row 2 and 3 in order to 
line the json objects up at column 2.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway


On 12/7/23 08:35, Daniel Verite wrote:

Joe Conway wrote:

The attached should fix the CopyOut response to say one column. I.e. it 
ought to look something like:


Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...

this number must be the number of columns in the source.
That is for COPY table(a,b,c)   the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.

I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.

The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.


Ok, that sounds reasonable to me -- I will revert that change.


The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"

ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.


If we really think that is a problem I can see about changing it to this 
format for json array:


8<--
copy
(
  with ss(f1, f2) as
  (
select 1, g.i from generate_series(1, 3) g(i)
  )
  select ss from ss
) to stdout (format json, force_array);
[{"ss":{"f1":1,"f2":1}}
,{"ss":{"f1":1,"f2":2}}
,{"ss":{"f1":1,"f2":3}}]
8<--

Is this acceptable to everyone?

Or maybe this is preferred?
8<--
[{"ss":{"f1":1,"f2":1}},
 {"ss":{"f1":1,"f2":2}},
 {"ss":{"f1":1,"f2":3}}]
8<--

Or as long as we are painting the shed, maybe this?
8<--
[{"ss":{"f1":1,"f2":1}},
{"ss":{"f1":1,"f2":2}},
{"ss":{"f1":1,"f2":3}}]
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread David G. Johnston

On Thursday, December 7, 2023, Daniel Verite 
wrote:

> Joe Conway wrote:
>
> > The attached should fix the CopyOut response to say one column. I.e. it
> > ought to look something like:
>
> Spending more time with the doc I came to the opinion that in this bit
> of the protocol, in CopyOutResponse (B)
> ...
> Int16
> The number of columns in the data to be copied (denoted N below).
> ...
>
> this number must be the number of columns in the source.
> That is for COPY table(a,b,c)   the number is 3, independently
> on whether the result is formatted in text, cvs, json or binary.
>
> I think that changing it for json can reasonably be interpreted
> as a protocol break and we should not do it.
>
> The fact that this value does not help parsing the CopyData
> messages that come next is not a new issue. A reader that
> doesn't know the field separator and whether it's text or csv
> cannot parse these messages into fields anyway.
> But just knowing how much columns there are in the original
> data might be useful by itself and we don't want to break that.


This argument for leaving 3 as the column count makes sense to me.  I agree
this content is not meant to facilitate interpreting the contents at a
protocol level.


>
>
> The other question for me is, in the CopyData message, this
> bit:
> " Messages sent from the backend will always correspond to single data
> rows"
>
> ISTM that considering that the "[" starting the json array is a
> "data row" is a stretch.
> That might be interpreted as a protocol break, depending
> on how strict the interpretation is.
>

We already effectively interpret this as “one content line per copydata
message” in the csv text with header line case.  I’d probably reword it to
state that explicitly and then we again don’t have to worry about the
protocol caring about any data semantics of the underlying content, only
physical semantics.

David J.

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Daniel Verite

Joe Conway wrote:

> The attached should fix the CopyOut response to say one column. I.e. it 
> ought to look something like:

Spending more time with the doc I came to the opinion that in this bit
of the protocol, in CopyOutResponse (B)
...
Int16
The number of columns in the data to be copied (denoted N below).
...

this number must be the number of columns in the source.
That is for COPY table(a,b,c)   the number is 3, independently
on whether the result is formatted in text, cvs, json or binary.

I think that changing it for json can reasonably be interpreted
as a protocol break and we should not do it.

The fact that this value does not help parsing the CopyData
messages that come next is not a new issue. A reader that
doesn't know the field separator and whether it's text or csv
cannot parse these messages into fields anyway.
But just knowing how much columns there are in the original
data might be useful by itself and we don't want to break that.

The other question for me is, in the CopyData message, this
bit:
" Messages sent from the backend will always correspond to single data rows"

ISTM that considering that the "[" starting the json array is a
"data row" is a stretch.
That might be interpreted as a protocol break, depending
on how strict the interpretation is.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Andrew Dunstan



On 2023-12-06 We 17:56, David G. Johnston wrote:

On Wed, Dec 6, 2023 at 3:38 PM Joe Conway  wrote:

So the questions are:
1. Do those two formats work for the initial implementation?


Yes.  We provide a stream-oriented format and one atomic-import format.

2. Is the default correct or should it be switched
    e.g. rather than specifying FORCE_ARRAY to get an
    array, something like FORCE_NO_ARRAY to get JSON lines
    and the JSON array is default?


No default?

Require explicit of a sub-format when the main format is JSON.

JSON_OBJECT_ROWS
JSON_ARRAY_OF_OBJECTS

For a future compact array-structured-composites sub-format:
JSON_ARRAY_OF_ARRAYS
JSON_ARRAY_ROWS




No default seems unlike the way we treat other COPY options. I'm not 
terribly fussed about which format to have as the default, but I think 
we should have one.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-07 Thread Joe Conway


On 12/6/23 21:56, Nathan Bossart wrote:

On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote:

If Nathan's perf results hold up elsewhere, it seems like some
micro-optimization around the text-pushing (appendStringInfoString)
might be more useful than caching.  The 7% spent in cache lookups
could be worth going after later, but it's not the top of the list.


Hah, it turns out my benchmark of 110M integers really stresses the
JSONTYPE_NUMERIC path in datum_to_json_internal().  That particular path
calls strlen() twice: once for IsValidJsonNumber(), and once in
appendStringInfoString().  If I save the result from IsValidJsonNumber()
and give it to appendBinaryStringInfo() instead, the COPY goes ~8% faster.
It's probably worth giving datum_to_json_internal() a closer look in a new
thread.


Yep, after looking through that code I was going to make the point that 
your 11 integer test was over indexing on that one type. I am sure there 
are other micro-optimizations to be made here, but I also think that it 
is outside the scope of the COPY TO JSON patch.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Nathan Bossart

On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote:
> If Nathan's perf results hold up elsewhere, it seems like some
> micro-optimization around the text-pushing (appendStringInfoString)
> might be more useful than caching.  The 7% spent in cache lookups
> could be worth going after later, but it's not the top of the list.

Hah, it turns out my benchmark of 110M integers really stresses the
JSONTYPE_NUMERIC path in datum_to_json_internal().  That particular path
calls strlen() twice: once for IsValidJsonNumber(), and once in
appendStringInfoString().  If I save the result from IsValidJsonNumber()
and give it to appendBinaryStringInfo() instead, the COPY goes ~8% faster.
It's probably worth giving datum_to_json_internal() a closer look in a new
thread.

diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 71ae53ff97..1951e93d9d 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -180,6 +180,7 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo 
result,
 {
 char   *outputstr;
 text   *jsontext;
+int len;
 
 check_stack_depth();
 
@@ -223,8 +224,8 @@ datum_to_json_internal(Datum val, bool is_null, StringInfo 
result,
  * Don't call escape_json for a non-key if it's a valid JSON
  * number.
  */
-if (!key_scalar && IsValidJsonNumber(outputstr, strlen(outputstr)))
-appendStringInfoString(result, outputstr);
+if (!key_scalar && IsValidJsonNumber(outputstr, (len = 
strlen(outputstr
+appendBinaryStringInfo(result, outputstr, len);
 else
 escape_json(result, outputstr);
 pfree(outputstr);

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Euler Taveira

On Wed, Dec 6, 2023, at 3:59 PM, Daniel Verite wrote:
> The first Copy data message with contents "5b0a" does not qualify
> as a row of data with 3 columns as advertised in the CopyOut
> message. Isn't that a problem?
> 
> At least the json non-ARRAY case ("json lines") doesn't have
> this issue, since every CopyData message corresponds effectively
> to a row in the table.

Moreover, if your interface wants to process the COPY data stream while
receiving it, you cannot provide "json array" format because each row (plus all
of the received ones) is not a valid JSON. Hence, a JSON parser cannot be
executed until you receive the whole data set. (wal2json format 1 has this
disadvantage. Format 2 was born to provide a better alternative -- each row is
a valid JSON.) I'm not saying that "json array" is not useful but that for
large data sets, it is less useful.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread David G. Johnston

On Wed, Dec 6, 2023 at 6:14 PM Joe Conway  wrote:

>
> > But the point that we should introduce a 2 still stands.  The new code
> > would mean: use text output functions but that there is no inherent
> > tabular structure in the underlying contents.  Instead the copy format
> > was JSON and the output layout is dependent upon the json options in the
> > copy command and that there really shouldn't be any attempt to turn the
> > contents directly into a tabular data structure like you presently do
> > with the CSV data under format 0.  Ignore the column count and column
> > formats as they are fixed or non-existent.
>
> I think that amounts to a protocol change, which we tend to avoid at all
> costs.
>
>
I wasn't sure on that point but figured it might be the case.  It is a
value change, not structural, which seems like it is the kind of
modification any living system might allow and be expected to have.  But I
also don't see any known problem with the current change of content
semantics without the format identification change.  Most of the relevant
context ends up out-of-band in the copy command itself.

David J.

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 20:09, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 5:57 PM Joe Conway > wrote:

On 12/6/23 19:39, David G. Johnston wrote:
 > On Wed, Dec 6, 2023 at 4:45 PM Joe Conway mailto:m...@joeconway.com>
 > >> wrote:

 > But I still cannot shake the belief that using a format code of 1 -
 > which really could be interpreted as meaning "textual csv" in
practice -
 > for this JSON output is unwise and we should introduce a new integer
 > value for the new fundamental output format.

No, I am pretty sure you still have that wrong. The "1" means binary
mode

Ok.  I made the same typo twice, I did mean to write 0 instead of 1.

Fair enough.

But the point that we should introduce a 2 still stands.  The new code 
would mean: use text output functions but that there is no inherent 
tabular structure in the underlying contents.  Instead the copy format 
was JSON and the output layout is dependent upon the json options in the 
copy command and that there really shouldn't be any attempt to turn the 
contents directly into a tabular data structure like you presently do 
with the CSV data under format 0.  Ignore the column count and column 
formats as they are fixed or non-existent.

I think that amounts to a protocol change, which we tend to avoid at all 
costs.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 18:09, Joe Conway wrote:

On 12/6/23 14:47, Joe Conway wrote:

On 12/6/23 13:59, Daniel Verite wrote:

Andrew Dunstan wrote:

IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).


"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 3
Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?



Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to
assemble the data correctly despite the extra CopyData messages.


Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?



The attached should fix the CopyOut response to say one column. I.e. it 
ought to look something like:


PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 1
 Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data: [...]


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..8915fb3 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,49 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 207,220 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 378,396 
   
  
 
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of square brackets as array decorations at the beginning
+   and end of output, and commas between the rows. It is allowed only in
+   COPY TO, and only when using
+   JSON format. The default is
+   false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..23b570f 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,425 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 444,451 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 543,555 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_array") == 0)
+ 		{
+ 			if (force_array_specified)
+ errorConflictingDefElem(defel, pstate);
+ 			force_array_specified = true;
+ 			opts_out->force_array = defGetBoolean(defel);
+ 		}
  		else if (strcmp(defel->defname, "convert_selectively") == 0)
  		{
  			/*
*** ProcessCopyOptions(ParseState *pstate,
*** 598,603 
--- 608,625 
  (errcode(ERRCODE_SYNTAX_ERROR),
   errmsg("cannot specify DEFAULT in BINARY mode")));
  
+ 	if (opts_out->json_mode)
+ 	{
+ 		if (is_from)
+ 			ereport(ERROR,
+

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread David G. Johnston

On Wed, Dec 6, 2023 at 5:57 PM Joe Conway  wrote:

> On 12/6/23 19:39, David G. Johnston wrote:
> > On Wed, Dec 6, 2023 at 4:45 PM Joe Conway  > > wrote:
>
> > But I still cannot shake the belief that using a format code of 1 -
> > which really could be interpreted as meaning "textual csv" in practice -
> > for this JSON output is unwise and we should introduce a new integer
> > value for the new fundamental output format.
>
> No, I am pretty sure you still have that wrong. The "1" means binary
> mode

Ok.  I made the same typo twice, I did mean to write 0 instead of 1.  But
the point that we should introduce a 2 still stands.  The new code would
mean: use text output functions but that there is no inherent tabular
structure in the underlying contents.  Instead the copy format was JSON and
the output layout is dependent upon the json options in the copy command
and that there really shouldn't be any attempt to turn the contents
directly into a tabular data structure like you presently do with the CSV
data under format 0.  Ignore the column count and column formats as they
are fixed or non-existent.

David J.

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 19:39, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 4:45 PM Joe Conway > wrote:



" The backend sends a CopyOutResponse message to the frontend, followed
     by zero or more CopyData messages (always one per row), followed by
     CopyDone"

probably "always one per row" would be changed to note that json array
format outputs two extra rows for the start/end bracket.


Fair, I was ascribing much more semantic meaning to this than it wants.

I don't see any real requirement, given the lack of semantics, to 
mention JSON at all.  It is one CopyData per row, regardless of the 
contents.  We don't delineate between the header and non-header data in 
CSV.  It isn't a protocol concern.


good point

But I still cannot shake the belief that using a format code of 1 - 
which really could be interpreted as meaning "textual csv" in practice - 
for this JSON output is unwise and we should introduce a new integer 
value for the new fundamental output format.


No, I am pretty sure you still have that wrong. The "1" means binary 
mode. As in

8<--
FORMAT

Selects the data format to be read or written: text, csv (Comma 
Separated Values), or binary. The default is text.

8<--

That is completely separate from text and csv. It literally means to use 
the binary output functions instead of the usual ones:


8<--
if (cstate->opts.binary)
getTypeBinaryOutputInfo(attr->atttypid,
_func_oid,
);
else
getTypeOutputInfo(attr->atttypid,
  _func_oid,
  );
8<--

Both "text" and "csv" mode use are non-binary output formats. I believe 
the JSON output format is also non-binary.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread David G. Johnston

On Wed, Dec 6, 2023 at 4:45 PM Joe Conway  wrote:

>
> " The backend sends a CopyOutResponse message to the frontend, followed
> by zero or more CopyData messages (always one per row), followed by
> CopyDone"
>
> probably "always one per row" would be changed to note that json array
> format outputs two extra rows for the start/end bracket.
>

Fair, I was ascribing much more semantic meaning to this than it wants.

I don't see any real requirement, given the lack of semantics, to mention
JSON at all.  It is one CopyData per row, regardless of the contents.  We
don't delineate between the header and non-header data in CSV.  It isn't a
protocol concern.

But I still cannot shake the belief that using a format code of 1 - which
really could be interpreted as meaning "textual csv" in practice - for this
JSON output is unwise and we should introduce a new integer value for the
new fundamental output format.

David J.

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 18:38, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 4:28 PM David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:

On Wed, Dec 6, 2023 at 4:09 PM Joe Conway mailto:m...@joeconway.com>> wrote:

On 12/6/23 14:47, Joe Conway wrote:
 > On 12/6/23 13:59, Daniel Verite wrote:
 >>      Andrew Dunstan wrote:
 >>
 >>> IMNSHO, we should produce either a single JSON
 >>> document (the ARRAY case) or a series of JSON documents,
one per row
 >>> (the LINES case).
 >>
 >> "COPY Operations" in the doc says:
 >>
 >> " The backend sends a CopyOutResponse message to the
frontend, followed
 >>     by zero or more CopyData messages (always one per row),
followed by
 >>     CopyDone".
 >>
 >> In the ARRAY case, the first messages with the copyjsontest
 >> regression test look like this (tshark output):
 >>
 >> PostgreSQL
 >>      Type: CopyOut response
 >>      Length: 13
 >>      Format: Text (0)
 >>      Columns: 3
 >>      Format: Text (0)

 > Anything receiving this and looking for a json array should
know how to
 > assemble the data correctly despite the extra CopyData messages.

Hmm, maybe the real problem here is that Columns do not equal
"3" for
the json mode case -- that should really say "1" I think,
because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe
those
are less likely to cause some hazard?

What is the limitation, if any, of introducing new type codes for
these.  n = 2..N for the different variants?  Or even -1 for "raw
text"?  And document that columns and structural rows need to be
determined out-of-band.  Continuing to use 1 (text) for this non-csv
data seems like a hack even if we can technically make it function. 
The semantics, especially for the array case, are completely

discarded or wrong.

Also, it seems like this answer would be easier to make if we implement 
COPY FROM now since how is the server supposed to deal with decomposing 
this data into tables without accurate type information?  I don't see 
implementing only half of the feature being a good idea.  I've had much 
more desire for FROM compared to TO personally.

Several people have weighed in on the side of getting COPY TO done by 
itself first. Given how long this discussion has already become for a 
relatively small and simple feature, I am a big fan of not expanding the 
scope now.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 18:28, David G. Johnston wrote:
On Wed, Dec 6, 2023 at 4:09 PM Joe Conway > wrote:

On 12/6/23 14:47, Joe Conway wrote:
 > On 12/6/23 13:59, Daniel Verite wrote:
 >>      Andrew Dunstan wrote:
 >>
 >>> IMNSHO, we should produce either a single JSON
 >>> document (the ARRAY case) or a series of JSON documents, one
per row
 >>> (the LINES case).
 >>
 >> "COPY Operations" in the doc says:
 >>
 >> " The backend sends a CopyOutResponse message to the frontend,
followed
 >>     by zero or more CopyData messages (always one per row),
followed by
 >>     CopyDone".
 >>
 >> In the ARRAY case, the first messages with the copyjsontest
 >> regression test look like this (tshark output):
 >>
 >> PostgreSQL
 >>      Type: CopyOut response
 >>      Length: 13
 >>      Format: Text (0)
 >>      Columns: 3
 >>      Format: Text (0)

 > Anything receiving this and looking for a json array should know
how to
 > assemble the data correctly despite the extra CopyData messages.

Hmm, maybe the real problem here is that Columns do not equal "3" for
the json mode case -- that should really say "1" I think, because the
row is not represented as 3 columns but rather 1 json object.

Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more
"rows" that actual output rows (the "[" and the "]"), but maybe those
are less likely to cause some hazard?

What is the limitation, if any, of introducing new type codes for 
these.  n = 2..N for the different variants?  Or even -1 for "raw 
text"?  And document that columns and structural rows need to be 
determined out-of-band.  Continuing to use 1 (text) for this non-csv 
data seems like a hack even if we can technically make it function.  The 
semantics, especially for the array case, are completely discarded or wrong.

I am not following you here. SendCopyBegin looks like this currently:

8<
SendCopyBegin(CopyToState cstate)
{
StringInfoData buf;
int natts = list_length(cstate->attnumlist);
int16   format = (cstate->opts.binary ? 1 : 0);
int i;

pq_beginmessage(, PqMsg_CopyOutResponse);
pq_sendbyte(, format);  /* overall format */
pq_sendint16(, natts);
for (i = 0; i < natts; i++)
pq_sendint16(, format); /* per-column formats */
pq_endmessage();
cstate->copy_dest = COPY_FRONTEND;
}
8<

The "1" is saying are we binary mode or not. JSON mode will never be 
sending in binary in the current implementation at least. And it always 
aggregates all the columns as one json object. So the correct answer is 
(I think):

8<
*** SendCopyBegin(CopyToState cstate)
*** 146,154 

pq_beginmessage(, PqMsg_CopyOutResponse);
pq_sendbyte(, format);  /* overall format */
!   pq_sendint16(, natts);
!   for (i = 0; i < natts; i++)
!   pq_sendint16(, format); /* per-column formats */
pq_endmessage();
cstate->copy_dest = COPY_FRONTEND;
  }
--- 150,169 

pq_beginmessage(, PqMsg_CopyOutResponse);
pq_sendbyte(, format);  /* overall format */
!   if (!cstate->opts.json_mode)
!   {
!   pq_sendint16(, natts);
!   for (i = 0; i < natts; i++)
!   pq_sendint16(, format); /* per-column formats */
!   }
!   else
!   {
!   /*
!* JSON mode is always one non-binary column
!*/
!   pq_sendint16(, 1);
!   pq_sendint16(, 0);
!   }
pq_endmessage();
cstate->copy_dest = COPY_FRONTEND;
  }
8<

That still leaves the need to fix the documentation:

" The backend sends a CopyOutResponse message to the frontend, followed
   by zero or more CopyData messages (always one per row), followed by
   CopyDone"

probably "always one per row" would be changed to note that json array 
format outputs two extra rows for the start/end bracket.

In fact, as written the patch does this:
8<
COPY (SELECT x.i, 'val' || x.i as v FROM
  generate_series(1, 3) x(i) WHERE false)
TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY);
[
]
8<

Not sure if that is a problem or not.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread David G. Johnston

On Wed, Dec 6, 2023 at 4:28 PM David G. Johnston 
wrote:

> On Wed, Dec 6, 2023 at 4:09 PM Joe Conway  wrote:
>
>> On 12/6/23 14:47, Joe Conway wrote:
>> > On 12/6/23 13:59, Daniel Verite wrote:
>> >>  Andrew Dunstan wrote:
>> >>
>> >>> IMNSHO, we should produce either a single JSON
>> >>> document (the ARRAY case) or a series of JSON documents, one per row
>> >>> (the LINES case).
>> >>
>> >> "COPY Operations" in the doc says:
>> >>
>> >> " The backend sends a CopyOutResponse message to the frontend, followed
>> >> by zero or more CopyData messages (always one per row), followed by
>> >> CopyDone".
>> >>
>> >> In the ARRAY case, the first messages with the copyjsontest
>> >> regression test look like this (tshark output):
>> >>
>> >> PostgreSQL
>> >>  Type: CopyOut response
>> >>  Length: 13
>> >>  Format: Text (0)
>> >>  Columns: 3
>> >>  Format: Text (0)
>>
>> > Anything receiving this and looking for a json array should know how to
>> > assemble the data correctly despite the extra CopyData messages.
>>
>> Hmm, maybe the real problem here is that Columns do not equal "3" for
>> the json mode case -- that should really say "1" I think, because the
>> row is not represented as 3 columns but rather 1 json object.
>>
>> Does that sound correct?
>>
>> Assuming yes, there is still maybe an issue that there are two more
>> "rows" that actual output rows (the "[" and the "]"), but maybe those
>> are less likely to cause some hazard?
>>
>>
> What is the limitation, if any, of introducing new type codes for these.
> n = 2..N for the different variants?  Or even -1 for "raw text"?  And
> document that columns and structural rows need to be determined
> out-of-band.  Continuing to use 1 (text) for this non-csv data seems like a
> hack even if we can technically make it function.  The semantics,
> especially for the array case, are completely discarded or wrong.
>
>
Also, it seems like this answer would be easier to make if we implement
COPY FROM now since how is the server supposed to deal with decomposing
this data into tables without accurate type information?  I don't see
implementing only half of the feature being a good idea.  I've had much
more desire for FROM compared to TO personally.

David J.

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread David G. Johnston

On Wed, Dec 6, 2023 at 4:09 PM Joe Conway  wrote:

> On 12/6/23 14:47, Joe Conway wrote:
> > On 12/6/23 13:59, Daniel Verite wrote:
> >>  Andrew Dunstan wrote:
> >>
> >>> IMNSHO, we should produce either a single JSON
> >>> document (the ARRAY case) or a series of JSON documents, one per row
> >>> (the LINES case).
> >>
> >> "COPY Operations" in the doc says:
> >>
> >> " The backend sends a CopyOutResponse message to the frontend, followed
> >> by zero or more CopyData messages (always one per row), followed by
> >> CopyDone".
> >>
> >> In the ARRAY case, the first messages with the copyjsontest
> >> regression test look like this (tshark output):
> >>
> >> PostgreSQL
> >>  Type: CopyOut response
> >>  Length: 13
> >>  Format: Text (0)
> >>  Columns: 3
> >>  Format: Text (0)
>
> > Anything receiving this and looking for a json array should know how to
> > assemble the data correctly despite the extra CopyData messages.
>
> Hmm, maybe the real problem here is that Columns do not equal "3" for
> the json mode case -- that should really say "1" I think, because the
> row is not represented as 3 columns but rather 1 json object.
>
> Does that sound correct?
>
> Assuming yes, there is still maybe an issue that there are two more
> "rows" that actual output rows (the "[" and the "]"), but maybe those
> are less likely to cause some hazard?
>
>
What is the limitation, if any, of introducing new type codes for these.  n
= 2..N for the different variants?  Or even -1 for "raw text"?  And
document that columns and structural rows need to be determined
out-of-band.  Continuing to use 1 (text) for this non-csv data seems like a
hack even if we can technically make it function.  The semantics,
especially for the array case, are completely discarded or wrong.

David J.

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 14:47, Joe Conway wrote:

On 12/6/23 13:59, Daniel Verite wrote:

Andrew Dunstan wrote:

IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).


"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 3
Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?



Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to
assemble the data correctly despite the extra CopyData messages.


Hmm, maybe the real problem here is that Columns do not equal "3" for 
the json mode case -- that should really say "1" I think, because the 
row is not represented as 3 columns but rather 1 json object.


Does that sound correct?

Assuming yes, there is still maybe an issue that there are two more 
"rows" that actual output rows (the "[" and the "]"), but maybe those 
are less likely to cause some hazard?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread David G. Johnston

On Wed, Dec 6, 2023 at 3:38 PM Joe Conway  wrote:

> So the questions are:
> 1. Do those two formats work for the initial implementation?
>

Yes.  We provide a stream-oriented format and one atomic-import format.

2. Is the default correct or should it be switched
> e.g. rather than specifying FORCE_ARRAY to get an
> array, something like FORCE_NO_ARRAY to get JSON lines
> and the JSON array is default?
>
>
No default?

Require explicit of a sub-format when the main format is JSON.

JSON_OBJECT_ROWS
JSON_ARRAY_OF_OBJECTS

For a future compact array-structured-composites sub-format:
JSON_ARRAY_OF_ARRAYS
JSON_ARRAY_ROWS

David J.

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway

On 12/6/23 16:42, Sehrope Sarkuni wrote:
On Wed, Dec 6, 2023 at 4:29 PM Joe Conway > wrote:

 > 1. Outputting a top level JSON object without the additional column
 > keys. IIUC, the top level keys are always the column names. A
common use
 > case would be a single json/jsonb column that is already formatted
 > exactly as the user would like for output. Rather than enveloping
it in
 > an object with a dedicated key, it would be nice to be able to
output it
 > directly. This would allow non-object results to be outputted as
well
 > (e.g., lines of JSON arrays, numbers, or strings). Due to how
JSON is
 > structured, I think this would play nice with the JSON lines v.s.
array
 > concept.
 >
 > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
 > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
 > SOME_OPTION_TO_NOT_ENVELOPE)
 > {"foo":1}
 > {"foo":2}
 > {"foo":3}

Your example does not match what you describe, or do I misunderstand? I
thought your goal was to eliminate the repeated "foo" from each row...

The "foo" in this case is explicit as I'm adding it when building the 
object. What I was trying to show was not adding an additional object 
wrapper / envelope.

So each row is:

{"foo":1}

Rather than:

"{"json_build_object":{"foo":1}}

I am still getting confused ;-)

Let's focus on the current proposed patch with a "minimum required 
feature set".

Right now the default behavior is "JSON lines":
8<---
COPY (SELECT x.i, 'val' || x.i as v FROM
  generate_series(1, 3) x(i))
TO STDOUT WITH (FORMAT JSON);
{"i":1,"v":"val1"}
{"i":2,"v":"val2"}
{"i":3,"v":"val3"}
8<---

and the other, non-default option is "JSON array":
8<---
COPY (SELECT x.i, 'val' || x.i as v FROM
  generate_series(1, 3) x(i))
TO STDOUT WITH (FORMAT JSON, FORCE_ARRAY);
[
 {"i":1,"v":"val1"}
,{"i":2,"v":"val2"}
,{"i":3,"v":"val3"}
]
8<---

So the questions are:
1. Do those two formats work for the initial implementation?
2. Is the default correct or should it be switched
   e.g. rather than specifying FORCE_ARRAY to get an
   array, something like FORCE_NO_ARRAY to get JSON lines
   and the JSON array is default?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Sehrope Sarkuni

On Wed, Dec 6, 2023 at 4:29 PM Joe Conway  wrote:

> > 1. Outputting a top level JSON object without the additional column
> > keys. IIUC, the top level keys are always the column names. A common use
> > case would be a single json/jsonb column that is already formatted
> > exactly as the user would like for output. Rather than enveloping it in
> > an object with a dedicated key, it would be nice to be able to output it
> > directly. This would allow non-object results to be outputted as well
> > (e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is
> > structured, I think this would play nice with the JSON lines v.s. array
> > concept.
> >
> > COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
> > generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
> > SOME_OPTION_TO_NOT_ENVELOPE)
> > {"foo":1}
> > {"foo":2}
> > {"foo":3}
>
> Your example does not match what you describe, or do I misunderstand? I
> thought your goal was to eliminate the repeated "foo" from each row...
>

The "foo" in this case is explicit as I'm adding it when building the
object. What I was trying to show was not adding an additional object
wrapper / envelope.

So each row is:

{"foo":1}

Rather than:

"{"json_build_object":{"foo":1}}

If each row has exactly one json / jsonb field, then the user has already
indicated the format for each row.

That same mechanism can be used to remove the "foo" entirely via a
json/jsonb array.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Nathan Bossart

On Wed, Dec 06, 2023 at 03:20:46PM -0500, Tom Lane wrote:
> If Nathan's perf results hold up elsewhere, it seems like some
> micro-optimization around the text-pushing (appendStringInfoString)
> might be more useful than caching.  The 7% spent in cache lookups
> could be worth going after later, but it's not the top of the list.

Agreed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Sehrope Sarkuni

On Wed, Dec 6, 2023 at 4:03 PM Andrew Dunstan  wrote:

> > The output size difference does say that maybe we should pay some
> > attention to the nearby request to not always label every field.
> > Perhaps there should be an option for each row to transform to
> > a JSON array rather than an object?
>
> I doubt it. People who want this are likely to want pretty much what
> this patch is providing, not something they would have to transform in
> order to get it. If they want space-efficient data they won't really be
> wanting JSON. Maybe they want Protocol Buffers or something in that vein.
>

For arrays v.s. objects, it's not just about data size. There are plenty of
situations where a JSON array is superior to an object (e.g. duplicate
column names). Lines of JSON arrays of strings is pretty much CSV with JSON
escaping rules and a pair of wrapping brackets. It's common for tabular
data in node.js environments as you don't need a separate CSV parser.

Each one has its place and a default of the row_to_json(...) representation
of the row still makes sense. But if the user has the option of outputting
a single json/jsonb field for each row without an object or array wrapper,
then it's possible to support all of these use cases as the user can
explicitly pick whatever envelope makes sense:

-- Lines of JSON arrays:
COPY (SELECT json_build_array('test-' || a, b) FROM generate_series(1, 3)
a, generate_series(5,6) b) TO STDOUT WITH (FORMAT JSON,
SOME_OPTION_TO_DISABLE_ENVELOPE);
["test-1", 5]
["test-2", 5]
["test-3", 5]
["test-1", 6]
["test-2", 6]
["test-3", 6]

-- Lines of JSON strings:
COPY (SELECT to_json('test-' || x) FROM generate_series(1, 5) x) TO STDOUT
WITH (FORMAT JSON, SOME_OPTION_TO_DISABLE_ENVELOPE);
"test-1"
"test-2"
"test-3"
"test-4"
"test-5"

I'm not sure how I feel about the behavior being automatic if it's a single
top level json / jsonb field rather than requiring the explicit option.
It's probably what a user would want but it also feels odd to change the
output wrapper automatically based on the fields in the response. If it is
automatic and the user wants the additional envelope, the option always
exists to wrap it further in another: json_build_object('some_field",
my_field_i_want_wrapped)

The duplicate field names would be a good test case too. I haven't gone
through this patch but I'm guessing it doesn't filter out duplicates so the
behavior would match up with row_to_json(...), i.e. duplicates are
preserved:

=> SELECT row_to_json(t.*) FROM (SELECT 1 AS a, 2 AS a) t;
  row_to_json
---
 {"a":1,"a":2}

If so, that's a good test case to add as however that's handled should be
deterministic.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 11:28, Sehrope Sarkuni wrote:

Big +1 to this overall feature.


cool!

Regarding the defaults for the output, I think JSON lines (rather than a 
JSON array of objects) would be preferred. It's more natural to combine 
them and generate that type of data on the fly rather than forcing 
aggregation into a single object.


So that is +2 (Sehrope and me) for the status quo (JSON lines), and +2 
(Andrew and Davin) for defaulting to json arrays. Anyone else want to 
weigh in on that issue?


Couple more features / use cases come to mind as well. Even if they're 
not part of a first round of this feature I think it'd be helpful to 
document them now as it might give some ideas for what does make that 
first cut:


1. Outputting a top level JSON object without the additional column 
keys. IIUC, the top level keys are always the column names. A common use 
case would be a single json/jsonb column that is already formatted 
exactly as the user would like for output. Rather than enveloping it in 
an object with a dedicated key, it would be nice to be able to output it 
directly. This would allow non-object results to be outputted as well 
(e.g., lines of JSON arrays, numbers, or strings). Due to how JSON is 
structured, I think this would play nice with the JSON lines v.s. array 
concept.


COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM 
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, 
SOME_OPTION_TO_NOT_ENVELOPE)

{"foo":1}
{"foo":2}
{"foo":3}


Your example does not match what you describe, or do I misunderstand? I 
thought your goal was to eliminate the repeated "foo" from each row...


2. An option to ignore null fields so they are excluded from the output. 
This would not be a default but would allow shrinking the total size of 
the output data in many situations. This would be recursive to allow 
nested objects to be shrunk down (not just the top level). This might be 
worthwhile as a standalone JSON function though handling it during 
output would be more efficient as it'd only be read once.


COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM 
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON, 
SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS)

{}
{"foo":2}
{"foo":3}


clear enough I think

3. Reverse of #2 when copying data in to allow defaulting missing fields 
to NULL.


good to record the ask, but applies to a different feature (COPY FROM 
instead of COPY TO).


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Andrew Dunstan




On 2023-12-06 We 15:20, Tom Lane wrote:

Joe Conway  writes:

I'll see if I can add some caching to composite_to_json(), but based on
the relative data size it does not sound like there is much performance
left on the table to go after, no?

If Nathan's perf results hold up elsewhere, it seems like some
micro-optimization around the text-pushing (appendStringInfoString)
might be more useful than caching.  The 7% spent in cache lookups
could be worth going after later, but it's not the top of the list.

The output size difference does say that maybe we should pay some
attention to the nearby request to not always label every field.
Perhaps there should be an option for each row to transform to
a JSON array rather than an object?





I doubt it. People who want this are likely to want pretty much what 
this patch is providing, not something they would have to transform in 
order to get it. If they want space-efficient data they won't really be 
wanting JSON. Maybe they want Protocol Buffers or something in that vein.


I see there's  nearby proposal to make this area pluggable at 




cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Tom Lane

Joe Conway  writes:
> I'll see if I can add some caching to composite_to_json(), but based on 
> the relative data size it does not sound like there is much performance 
> left on the table to go after, no?

If Nathan's perf results hold up elsewhere, it seems like some
micro-optimization around the text-pushing (appendStringInfoString)
might be more useful than caching.  The 7% spent in cache lookups
could be worth going after later, but it's not the top of the list.

The output size difference does say that maybe we should pay some
attention to the nearby request to not always label every field.
Perhaps there should be an option for each row to transform to
a JSON array rather than an object?

regards, tom lane

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 11:44, Nathan Bossart wrote:

On Wed, Dec 06, 2023 at 10:33:49AM -0600, Nathan Bossart wrote:

(format csv)
Time: 12295.480 ms (00:12.295)
Time: 12311.059 ms (00:12.311)
Time: 12305.469 ms (00:12.305)

(format json)
Time: 24568.621 ms (00:24.569)
Time: 23756.234 ms (00:23.756)
Time: 24265.730 ms (00:24.266)


I should also note that the json output is 85% larger than the csv output.


I'll see if I can add some caching to composite_to_json(), but based on 
the relative data size it does not sound like there is much performance 
left on the table to go after, no?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 13:59, Daniel Verite wrote:

Andrew Dunstan wrote:

IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).


"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
by zero or more CopyData messages (always one per row), followed by
CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
 Type: CopyOut response
 Length: 13
 Format: Text (0)
 Columns: 3
Format: Text (0)
PostgreSQL
 Type: Copy data
 Length: 6
 Copy data: 5b0a
PostgreSQL
 Type: Copy data
 Length: 76
 Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?



Is it a real problem, or just a bit of documentation change that I missed?

Anything receiving this and looking for a json array should know how to 
assemble the data correctly despite the extra CopyData messages.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Daniel Verite

Andrew Dunstan wrote:

> IMNSHO, we should produce either a single JSON 
> document (the ARRAY case) or a series of JSON documents, one per row 
> (the LINES case).

"COPY Operations" in the doc says:

" The backend sends a CopyOutResponse message to the frontend, followed
   by zero or more CopyData messages (always one per row), followed by
   CopyDone".

In the ARRAY case, the first messages with the copyjsontest
regression test look like this (tshark output):

PostgreSQL
Type: CopyOut response
Length: 13
Format: Text (0)
Columns: 3
Format: Text (0)
PostgreSQL
Type: Copy data
Length: 6
Copy data: 5b0a
PostgreSQL
Type: Copy data
Length: 76
Copy data:
207b226964223a312c226631223a226c696e652077697468205c2220696e2069743a2031…

The first Copy data message with contents "5b0a" does not qualify
as a row of data with 3 columns as advertised in the CopyOut
message. Isn't that a problem?

At least the json non-ARRAY case ("json lines") doesn't have
this issue, since every CopyData message corresponds effectively
to a row in the table.

[1] https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Nathan Bossart

On Wed, Dec 06, 2023 at 10:33:49AM -0600, Nathan Bossart wrote:
>   (format csv)
>   Time: 12295.480 ms (00:12.295)
>   Time: 12311.059 ms (00:12.311)
>   Time: 12305.469 ms (00:12.305)
> 
>   (format json)
>   Time: 24568.621 ms (00:24.569)
>   Time: 23756.234 ms (00:23.756)
>   Time: 24265.730 ms (00:24.266)

I should also note that the json output is 85% larger than the csv output.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Nathan Bossart

On Wed, Dec 06, 2023 at 11:28:59AM -0500, Tom Lane wrote:
> It might be acceptable to plan on improving the performance later,
> depending on just how bad it is now.

On 10M rows with 11 integers each, I'm seeing the following:

(format text)
Time: 10056.311 ms (00:10.056)
Time: 8789.331 ms (00:08.789)
Time: 8755.070 ms (00:08.755)

(format csv)
Time: 12295.480 ms (00:12.295)
Time: 12311.059 ms (00:12.311)
Time: 12305.469 ms (00:12.305)

(format json)
Time: 24568.621 ms (00:24.569)
Time: 23756.234 ms (00:23.756)
Time: 24265.730 ms (00:24.266)

'perf top' tends to look a bit like this:

  13.31%  postgres  [.] appendStringInfoString
   7.57%  postgres  [.] datum_to_json_internal
   6.82%  postgres  [.] SearchCatCache1
   5.35%  [kernel]  [k] intel_gpio_irq
   3.57%  postgres  [.] composite_to_json
   3.31%  postgres  [.] IsValidJsonNumber

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Tom Lane

Andrew Dunstan  writes:
> On 2023-12-06 We 10:44, Tom Lane wrote:
>> In particular, has anyone done any performance testing?
>> I'm concerned about that because composite_to_json() has
>> zero capability to cache any metadata across calls, meaning
>> there is going to be a large amount of duplicated work
>> per row.

> Yeah, that's hard to deal with, too, as it can be called recursively.

Right.  On the plus side, if we did improve this it would presumably
also benefit other callers of composite_to_json[b].

> OTOH I'd rather have a version of this that worked slowly than none at all.

It might be acceptable to plan on improving the performance later,
depending on just how bad it is now.

regards, tom lane

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Sehrope Sarkuni

Big +1 to this overall feature.

This is something I've wanted for a long time as well. While it's possible
to use a COPY with text output for a trivial case, the double escaping
falls apart quickly for arbitrary data. It's really only usable when you
know exactly what you are querying and know it will not be a problem.

Regarding the defaults for the output, I think JSON lines (rather than a
JSON array of objects) would be preferred. It's more natural to combine
them and generate that type of data on the fly rather than forcing
aggregation into a single object.

Couple more features / use cases come to mind as well. Even if they're not
part of a first round of this feature I think it'd be helpful to document
them now as it might give some ideas for what does make that first cut:

1. Outputting a top level JSON object without the additional column keys.
IIUC, the top level keys are always the column names. A common use case
would be a single json/jsonb column that is already formatted exactly as
the user would like for output. Rather than enveloping it in an object with
a dedicated key, it would be nice to be able to output it directly. This
would allow non-object results to be outputted as well (e.g., lines of JSON
arrays, numbers, or strings). Due to how JSON is structured, I think this
would play nice with the JSON lines v.s. array concept.

COPY (SELECT json_build_object('foo', x) AS i_am_ignored FROM
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
SOME_OPTION_TO_NOT_ENVELOPE)
{"foo":1}
{"foo":2}
{"foo":3}

2. An option to ignore null fields so they are excluded from the output.
This would not be a default but would allow shrinking the total size of the
output data in many situations. This would be recursive to allow nested
objects to be shrunk down (not just the top level). This might be
worthwhile as a standalone JSON function though handling it during output
would be more efficient as it'd only be read once.

COPY (SELECT json_build_object('foo', CASE WHEN x > 1 THEN x END) FROM
generate_series(1, 3) x) TO STDOUT WITH (FORMAT JSON,
SOME_OPTION_TO_NOT_ENVELOPE, JSON_SKIP_NULLS)
{}
{"foo":2}
{"foo":3}

3. Reverse of #2 when copying data in to allow defaulting missing fields to
NULL.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Tom Lane

Joe Conway  writes:
> On 12/6/23 10:44, Tom Lane wrote:
>> In particular, has anyone done any performance testing?

> I will devise some kind of test and report back. I suppose something 
> with many rows and many narrow columns comparing time to COPY 
> text/csv/json modes would do the trick?

Yeah.  If it's at least in the same ballpark as the existing text/csv
formats then I'm okay with it.  I'm worried that it might be 10x worse,
in which case I think we'd need to do something.

regards, tom lane

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Andrew Dunstan




On 2023-12-06 We 10:44, Tom Lane wrote:

Joe Conway  writes:

I believe this is ready to commit unless there are further comments or
objections.

I thought we were still mostly at proof-of-concept stage?

In particular, has anyone done any performance testing?
I'm concerned about that because composite_to_json() has
zero capability to cache any metadata across calls, meaning
there is going to be a large amount of duplicated work
per row.





Yeah, that's hard to deal with, too, as it can be called recursively.

OTOH I'd rather have a version of this that worked slowly than none at all.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 10:44, Tom Lane wrote:

Joe Conway  writes:
I believe this is ready to commit unless there are further comments or 
objections.


I thought we were still mostly at proof-of-concept stage?


The concept is narrowly scoped enough that I think we are homing in on 
the final patch.



In particular, has anyone done any performance testing?
I'm concerned about that because composite_to_json() has
zero capability to cache any metadata across calls, meaning
there is going to be a large amount of duplicated work
per row.


I will devise some kind of test and report back. I suppose something 
with many rows and many narrow columns comparing time to COPY 
text/csv/json modes would do the trick?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 10:32, Andrew Dunstan wrote:


On 2023-12-06 We 08:49, Joe Conway wrote:

On 12/6/23 07:36, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:46, Joe Conway wrote:

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of 
MSSQL server I think [1] which had the non-array with commas 
import and export style. It was not that tough to support and the 
code as written already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some 
absurdity in
the CSV code (much of it down to me), so maybe we need to be 
liberal in
what we accept here too. IMNSHO, we should produce either a single 
JSON

document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).


So your preference would be to not allow the non-array-with-commas 
case but if/when we implement COPY FROM we would accept that format? 
As in Postel'a law ("be conservative in what you do, be liberal in 
what you accept from others")?



Yes, I think so.


Awesome. The attached does it that way. I also ran pgindent.

I believe this is ready to commit unless there are further comments or 
objections.


Sorry to bikeshed a little more, I'm a bit late looking at this.

I suspect that most users will actually want the table as a single JSON
document, so it should probably be the default. In any case FORCE_ARRAY
as an option has a slightly wrong feel to it.


Sure, I can make that happen, although I figured that for the 
many-rows-scenario the single array size might be an issue for whatever 
you are importing into.



I'm having trouble coming up with a good name for the reverse of
that, off the top of my head.


Will think about it and propose something with the next patch revision.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Tom Lane

Joe Conway  writes:
> I believe this is ready to commit unless there are further comments or 
> objections.

I thought we were still mostly at proof-of-concept stage?

In particular, has anyone done any performance testing?
I'm concerned about that because composite_to_json() has
zero capability to cache any metadata across calls, meaning
there is going to be a large amount of duplicated work
per row.

regards, tom lane

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Andrew Dunstan




On 2023-12-06 We 08:49, Joe Conway wrote:

On 12/6/23 07:36, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:46, Joe Conway wrote:

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of 
MSSQL server I think [1] which had the non-array with commas 
import and export style. It was not that tough to support and the 
code as written already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some 
absurdity in
the CSV code (much of it down to me), so maybe we need to be 
liberal in
what we accept here too. IMNSHO, we should produce either a single 
JSON

document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).


So your preference would be to not allow the non-array-with-commas 
case but if/when we implement COPY FROM we would accept that format? 
As in Postel'a law ("be conservative in what you do, be liberal in 
what you accept from others")?



Yes, I think so.


Awesome. The attached does it that way. I also ran pgindent.

I believe this is ready to commit unless there are further comments or 
objections.



Sorry to bikeshed a little more, I'm a bit late looking at this.

I suspect that most users will actually want the table as a single JSON 
document, so it should probably be the default. In any case FORCE_ARRAY 
as an option has a slightly wrong feel to it. I'm having trouble coming 
up with a good name for the reverse of that, off the top of my head.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Joe Conway


On 12/6/23 07:36, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:46, Joe Conway wrote:

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as 
written already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some absurdity in
the CSV code (much of it down to me), so maybe we need to be liberal in
what we accept here too. IMNSHO, we should produce either a single JSON
document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).


So your preference would be to not allow the non-array-with-commas 
case but if/when we implement COPY FROM we would accept that format? 
As in Postel'a law ("be conservative in what you do, be liberal in 
what you accept from others")?



Yes, I think so.


Awesome. The attached does it that way. I also ran pgindent.

I believe this is ready to commit unless there are further comments or 
objections.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Add json format mode to COPY TO

Add json format mode support to COPY TO, which includes two output
variations: 1) "json lines" which is each row as a json object delimited
by newlines (the default); and 2) "json array" which is the same as #1,
but with the addition of a leading "[", trailing "]", and comma row
delimiters, to form a valid json array.

Early versions: helpful hints/reviews provided by Nathan Bossart,
Tom Lane, and Maciek Sakrejda. Final versions: reviewed by Andrew Dunstan
and Davin Shearer.

Requested-by: Davin Shearer
Author: Joe Conway
Reviewed-by: Andrew Dunstan, Davin Shearer 
Discussion: https://postgr.es/m/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..8915fb3 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,49 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 207,220 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 378,396 
   
  
 
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of square brackets as array decorations at the beginning
+   and end of output, and commas between the rows. It is allowed only in
+   COPY TO, and only when using
+   JSON format. The default is
+   false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..23b570f 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,425 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 444,451 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 543,555 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_array") == 0)
+ 		{
+ 			if (force_array_specified)
+ errorConflictingDefElem(defel, pstate);
+ 			force_array_specified = true;
+ 			opts_out->force_array = defGetBoolean(defel);
+ 		}
  		else if (strcmp(defel->defname, "convert_selectively") == 0)
  		{
  			/*
*** ProcessCopyOptions(ParseState *pstate,
*** 598,603 
--- 608,625 
  (errcode(ERRCODE_SYNTAX_ERROR),
   errmsg("cannot specify DEFAULT in BINARY mode")));
  
+ 	if (opts_out->json_mode)
+ 	{
+ 		if (is_from)
+ 			ereport(ERROR,
+ 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 	 errmsg("cannot use JSON mode in

Re: Emitting JSON to file using COPY TO

2023-12-06 Thread Andrew Dunstan




On 2023-12-05 Tu 16:46, Joe Conway wrote:

On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as 
written already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some absurdity in
the CSV code (much of it down to me), so maybe we need to be liberal in
what we accept here too. IMNSHO, we should produce either a single JSON
document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).



So your preference would be to not allow the non-array-with-commas 
case but if/when we implement COPY FROM we would accept that format? 
As in Postel'a law ("be conservative in what you do, be liberal in 
what you accept from others")?



Yes, I think so.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Davin Shearer

> Am I understanding something incorrectly?

No, you've got it.  You already covered the concerns there.

> That seems quite absurd, TBH. I know we've catered for some absurdity in
> the CSV code (much of it down to me), so maybe we need to be liberal in
> what we accept here too. IMNSHO, we should produce either a single JSON
> document (the ARRAY case) or a series of JSON documents, one per row
> (the LINES case).

For what it's worth, I agree with Andrew on this.  I also agree with COPY
FROM allowing for potentially bogus commas at the end of non-arrays for
interop with other products, but to not do that in COPY TO (unless there is
some real compelling case to do so).  Emitting bogus JSON (non-array with
commas) feels wrong and would be nice to not perpetuate that, if possible.

Thanks again for doing this.  If I can be of any help, let me know.
If\When this makes it into the production product, I'll be using this
feature for sure.

-Davin

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 16:20, Andrew Dunstan wrote:

On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as written 
already does it, so why not?


That seems quite absurd, TBH. I know we've catered for some absurdity in
the CSV code (much of it down to me), so maybe we need to be liberal in
what we accept here too. IMNSHO, we should produce either a single JSON
document (the ARRAY case) or a series of JSON documents, one per row
(the LINES case).



So your preference would be to not allow the non-array-with-commas case 
but if/when we implement COPY FROM we would accept that format? As in 
Postel'a law ("be conservative in what you do, be liberal in what you 
accept from others")?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Andrew Dunstan




On 2023-12-05 Tu 16:09, Joe Conway wrote:

On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and 
export style. It was not that tough to support and the code as written 
already does it, so why not?


[1] 
https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result





That seems quite absurd, TBH. I know we've catered for some absurdity in 
the CSV code (much of it down to me), so maybe we need to be liberal in 
what we accept here too. IMNSHO, we should produce either a single JSON 
document (the ARRAY case) or a series of JSON documents, one per row 
(the LINES case).



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 16:12, Andrew Dunstan wrote:


On 2023-12-05 Tu 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like 
maybe including into a larger array.  I definitely appreciate these 
options and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, 
false), (false, true), (true, true)}? Technically you've already 
shown me the (false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array 
case. It says so here:

8<---
+    
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and 
only

+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+ if (opts_out->force_array &&
+ force_row_delimiter_specified &&
+ !opts_out->force_row_delimiter)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("cannot specify FORCE_ROW_DELIMITER 
false with FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?



But what's the point of having it if you're not using FORCE_ARRAY?



See the follow up email -- other databases support it so why not? It 
seems to be a thing...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Andrew Dunstan




On 2023-12-05 Tu 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like 
maybe including into a larger array.  I definitely appreciate these 
options and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, 
false), (false, true), (true, true)}? Technically you've already 
shown me the (false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array 
case. It says so here:

8<---
+    
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and 
only

+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+ if (opts_out->force_array &&
+ force_row_delimiter_specified &&
+ !opts_out->force_row_delimiter)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("cannot specify FORCE_ROW_DELIMITER 
false with FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?



But what's the point of having it if you're not using FORCE_ARRAY?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 16:02, Joe Conway wrote:

On 12/5/23 15:55, Andrew Dunstan wrote:

and in any other case (e.g. LINES) I can't see why you
would have them.


Oh I didn't address this -- I saw examples in the interwebs of MSSQL 
server I think [1] which had the non-array with commas import and export 
style. It was not that tough to support and the code as written already 
does it, so why not?


[1] 
https://learn.microsoft.com/en-us/sql/relational-databases/json/remove-square-brackets-from-json-without-array-wrapper-option?view=sql-server-ver16#example-multiple-row-result



--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 15:55, Andrew Dunstan wrote:


On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe 
including into a larger array.  I definitely appreciate these options 
and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), 
(false, true), (true, true)}?  Technically you've already shown me the 
(false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is
only one legal delimiter of array items in JSON, and that's a comma.
There's no alternative and it's not optional. So in the array case you
MUST have commas and in any other case (e.g. LINES) I can't see why you
would have them.


The current patch already *does* imply row delimiters in the array case. 
It says so here:

8<---
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of 
output.

+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
8<---

and it does so here:
8<---
+ if (opts_out->force_array)
+ opts_out->force_row_delimiter = true;
8<---

and it shows that here:
8<---
+ copy copytest to stdout (format json, force_array);
+ [
+  {"style":"DOS","test":"abc\r\ndef","filler":1}
+ ,{"style":"Unix","test":"abc\ndef","filler":2}
+ ,{"style":"Mac","test":"abc\rdef","filler":3}
+ ,{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
+ ]
8<---

It also does not allow explicitly setting row delimiters false while 
force_array is true here:

8<---

+   if (opts_out->force_array &&
+   force_row_delimiter_specified &&
+   !opts_out->force_row_delimiter)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 	 errmsg("cannot specify FORCE_ROW_DELIMITER false with 
FORCE_ARRAY true")));

8<---

Am I understanding something incorrectly?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Andrew Dunstan




On 2023-12-05 Tu 14:50, Davin Shearer wrote:

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, 
we should also _imply_ FORCE ROW DELIMITER.  I can't envision a use 
case where someone would want to use FORCE ARRAY without also using 
FORCE ROW DELIMITER.  I can, however, envision a use case where 
someone would want FORCE ROW DELIMITER without FORCE ARRAY, like maybe 
including into a larger array.  I definitely appreciate these options 
and the flexibility that they afford from a user perspective.


In the test output, will you also show the different variations with 
FORCE ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), 
(false, true), (true, true)}?  Technically you've already shown me the 
(false, false) case as those are the defaults.





I don't understand the point of FORCE_ROW_DELIMITER at all. There is 
only one legal delimiter of array items in JSON, and that's a comma. 
There's no alternative and it's not optional. So in the array case you 
MUST have commas and in any other case (e.g. LINES) I can't see why you 
would have them.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Davin Shearer

Hi Joe,

In reviewing the 005 patch, I think that when used with FORCE ARRAY, we
should also _imply_ FORCE ROW DELIMITER.  I can't envision a use case where
someone would want to use FORCE ARRAY without also using FORCE ROW
DELIMITER.  I can, however, envision a use case where someone would want
FORCE ROW DELIMITER without FORCE ARRAY, like maybe including into a larger
array.  I definitely appreciate these options and the flexibility that they
afford from a user perspective.

In the test output, will you also show the different variations with FORCE
ARRAY and FORCE ROW DELIMITER => {(false, false), (true, false), (false,
true), (true, true)}?  Technically you've already shown me the (false,
false) case as those are the defaults.

Thanks!

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/5/23 12:43, Davin Shearer wrote:

Joe, those test cases look great and the outputs are the same as `jq`.




Forward slash escaping is optional, so not escaping them in Postgres is 
okay. The important thing is that the software _reading_ JSON 
interprets both '\/' and '/' as '/'.


Thanks for the review and info. I modified the existing regression test 
thus:


8<--
create temp table copyjsontest (
id bigserial,
f1 text,
f2 timestamptz);

insert into copyjsontest
  select g.i,
 CASE WHEN g.i % 2 = 0 THEN
   'line with '' in it: ' || g.i::text
 ELSE
   'line with " in it: ' || g.i::text
 END,
 'Mon Feb 10 17:32:01 1997 PST'
  from generate_series(1,5) as g(i);

insert into copyjsontest (f1) values
(E'aaa\"bbb'::text),
(E'aaa\\bbb'::text),
(E'aaa\/bbb'::text),
(E'aaa\'::text),
(E'aaa\fbbb'::text),
(E'aaa\nbbb'::text),
(E'aaa\rbbb'::text),
(E'aaa\tbbb'::text);
copy copyjsontest to stdout json;
{"id":1,"f1":"line with \" in it: 1","f2":"1997-02-10T20:32:01-05:00"}
{"id":2,"f1":"line with ' in it: 2","f2":"1997-02-10T20:32:01-05:00"}
{"id":3,"f1":"line with \" in it: 3","f2":"1997-02-10T20:32:01-05:00"}
{"id":4,"f1":"line with ' in it: 4","f2":"1997-02-10T20:32:01-05:00"}
{"id":5,"f1":"line with \" in it: 5","f2":"1997-02-10T20:32:01-05:00"}
{"id":1,"f1":"aaa\"bbb","f2":null}
{"id":2,"f1":"aaa\\bbb","f2":null}
{"id":3,"f1":"aaa/bbb","f2":null}
{"id":4,"f1":"aaa\","f2":null}
{"id":5,"f1":"aaa\fbbb","f2":null}
{"id":6,"f1":"aaa\nbbb","f2":null}
{"id":7,"f1":"aaa\rbbb","f2":null}
{"id":8,"f1":"aaa\tbbb","f2":null}
8<--

I think the code, documentation, and tests are in pretty good shape at 
this point. Latest version attached.


Any other comments or complaints out there?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Add json format mode to COPY TO

Add json format mode support to COPY TO, which includes three output
variations: 1) "json lines" which is each row as a json object delimited
by newlines (the default); 2) "json lines", except include comma delimiters
between json objects; and 3) "json array" which is the same as #2, but with
the addition of a leading "[" and trailing "]" to form a valid json array.

Early versions: helpful hints/reviews provided by Nathan Bossart,
Tom Lane, and Maciek Sakrejda. Final versions: reviewed by Andrew Dunstan
and Davin Shearer.

Requested-by: Davin Shearer
Author: Joe Conway
Reviewed-by: Andrew Dunstan, Davin Shearer 
Discussion: https://postgr.es/m/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5


diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..0236a9e 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,426 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_row_delimiter_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 445,452 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Davin Shearer

Thanks for the wayback machine link Andrew.  I read it, understood it, and
will comply.

Joe, those test cases look great and the outputs are the same as `jq`.

As for forward slashes being escaped, I found this:
https://stackoverflow.com/questions/1580647/json-why-are-forward-slashes-escaped
.

Forward slash escaping is optional, so not escaping them in Postgres is
okay.   The important thing is that the software _reading_ JSON interprets
both '\/' and '/' as '/'.

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Joe Conway


On 12/4/23 21:54, Joe Conway wrote:

On 12/4/23 17:55, Davin Shearer wrote:

There are however a few characters that need to be escaped



 1. |"|(double quote)
 2. |\|(backslash)
 3. |/|(forward slash)
 4. |\b|(backspace)
 5. |\f|(form feed)
 6. |\n|(new line)
 7. |\r|(carriage return)
 8. |\t|(horizontal tab)

These characters should be represented in the test cases to see how the 
escaping behaves and to ensure that the escaping is done properly per 
JSON requirements.


I can look at adding these as test cases.

So I did a quick check:
8<--
with t(f1) as
(
  values
(E'aaa\"bbb'::text),
(E'aaa\\bbb'::text),
(E'aaa\/bbb'::text),
(E'aaa\'::text),
(E'aaa\fbbb'::text),
(E'aaa\nbbb'::text),
(E'aaa\rbbb'::text),
(E'aaa\tbbb'::text)
)
select
  length(t.f1),
  t.f1,
  row_to_json(t)
from t;
 length | f1  |row_to_json
+-+---
  7 | aaa"bbb | {"f1":"aaa\"bbb"}
  7 | aaa\bbb | {"f1":"aaa\\bbb"}
  7 | aaa/bbb | {"f1":"aaa/bbb"}
  7 | aaa\x08bbb  | {"f1":"aaa\"}
  7 | aaa\x0Cbbb  | {"f1":"aaa\fbbb"}
  7 | aaa+| {"f1":"aaa\nbbb"}
| bbb |
  7 | aaa\rbbb| {"f1":"aaa\rbbb"}
  7 | aaa bbb | {"f1":"aaa\tbbb"}
(8 rows)

8<--

This is all independent of my patch for COPY TO. If I am reading that 
correctly, everything matches Davin's table *except* the forward slash 
("/"). I defer to the experts on the thread to debate that...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-05 Thread Andrew Dunstan




On 2023-12-04 Mo 17:55, Davin Shearer wrote:
Sorry about the top posting / top quoting... the link you sent me 
gives me a 404.  I'm not exactly sure what top quoting / posting means 
and Googling those terms wasn't helpful for me, but I've removed the 
quoting that my mail client is automatically "helpfully" adding to my 
emails.  I mean no offense.



Hmm. Luckily the Wayback Machine has a copy: 



Maybe I'll put a copy in the developer wiki.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Joe Conway


On 12/4/23 17:55, Davin Shearer wrote:
Sorry about the top posting / top quoting... the link you sent me gives 
me a 404.  I'm not exactly sure what top quoting / posting means and 
Googling those terms wasn't helpful for me, but I've removed the quoting 
that my mail client is automatically "helpfully" adding to my emails.  I 
mean no offense.


No offense taken. But it is worthwhile to conform to the very long 
established norms of the mailing lists on which you participate. See:


  https://en.wikipedia.org/wiki/Posting_style

I would describe the Postgres list style (based on that link) as

   "inline replying, in which the different parts of the reply follow
the relevant parts of the original post...[with]...trimming of the
original text"


There are however a few characters that need to be escaped



 1. |"|(double quote)
 2. |\|(backslash)
 3. |/|(forward slash)
 4. |\b|(backspace)
 5. |\f|(form feed)
 6. |\n|(new line)
 7. |\r|(carriage return)
 8. |\t|(horizontal tab)

These characters should be represented in the test cases to see how the 
escaping behaves and to ensure that the escaping is done properly per 
JSON requirements.


I can look at adding these as test cases. The latest version of the 
patch (attached) includes some of that already. For reference, the tests 
so far include this:


8<---
test=# select * from copytest;
  style  |   test   | filler
-+--+
 DOS | abc\r   +|  1
 | def  |
 Unix| abc +|  2
 | def  |
 Mac | abc\rdef |  3
 esc\ape | a\r\\r\ +|  4
 | \nb  |
(4 rows)

test=# copy copytest to stdout (format json);
{"style":"DOS","test":"abc\r\ndef","filler":1}
{"style":"Unix","test":"abc\ndef","filler":2}
{"style":"Mac","test":"abc\rdef","filler":3}
{"style":"esc\\ape","test":"a\\r\\\r\\\n\\nb","filler":4}
8<---

At this point "COPY TO" should be sending exactly the unaltered output 
of the postgres JSON processing functions.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..0236a9e 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 419,424 
--- 419,426 
  	bool		format_specified = false;
  	bool		freeze_specified = false;
  	bool		header_specified = false;
+ 	bool		force_row_delimiter_specified = false;
+ 	bool		force_array_specified = false;
  	ListCell   *option;
  
  	/* Support external use for option sanity checking */
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 445,452 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 544,563 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_row_delimiter") == 0)
+ 		{
+ 			if (force_row_delimiter_specified)
+ errorConflictingDefElem(defel, pstate);
+ 			force_row_delimiter_specified = true;
+

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Davin Shearer

Sorry about the top posting / top quoting... the link you sent me gives me
a 404.  I'm not exactly sure what top quoting / posting means and Googling
those terms wasn't helpful for me, but I've removed the quoting that my
mail client is automatically "helpfully" adding to my emails.  I mean no
offense.

Okay, digging in more...

If the value contains text that has BOMs [footnote 1] in it, it must be
preserved (the database doesn't need to interpret them or do anything
special with them - just store it and fetch it).  There are however a few
characters that need to be escaped (per
https://www.w3docs.com/snippets/java/how-should-i-escape-strings-in-json.html)
so that the JSON format isn't broken.  They are:


   1. " (double quote)
   2. \ (backslash)
   3. / (forward slash)
   4. \b (backspace)
   5. \f (form feed)
   6. \n (new line)
   7. \r (carriage return)
   8. \t (horizontal tab)

These characters should be represented in the test cases to see how the
escaping behaves and to ensure that the escaping is done properly per JSON
requirements.  Forward slash comes as a bit of a surprise to me, but `jq`
handles it either way:

➜ echo '{"key": "this / is a forward slash"}' | jq .
{
  "key": "this / is a forward slash"
}
➜ echo '{"key": "this \/ is a forward slash"}' | jq .
{
  "key": "this / is a forward slash"
}

Hope it helps, and thank you!

1. I don't disagree that BOMs shouldn't be used for UTF-8, but I'm also
processing UTF-16{BE,LE} and UTF-32{BE,LE} (as well as other textural
formats that are neither ASCII or Unicode).  I don't have the luxury of
changing the data that is given.

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Andrew Dunstan



On 2023-12-04 Mo 13:37, Davin Shearer wrote:

Looking great!

For testing, in addition to the quotes, include DOS and Unix EOL, \ 
and /, Byte Order Markers, and mulitbyte characters like UTF-8.


Essentially anything considered textural is fair game to be a value.



Joe already asked you to avoid top-posting on PostgreSQL lists. See 
 
> for an explanation.


We don't process BOMs elsewhere, and probably should not here either. 
They are in fact neither required nor recommended for use with UTF8 
data, AIUI. See a recent discussion on this list on that topic: 




cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Davin Shearer

Looking great!

For testing, in addition to the quotes, include DOS and Unix EOL, \ and /,
Byte Order Markers, and mulitbyte characters like UTF-8.

Essentially anything considered textural is fair game to be a value.

On Mon, Dec 4, 2023, 10:46 Joe Conway  wrote:

> On 12/4/23 09:25, Andrew Dunstan wrote:
> >
> > On 2023-12-04 Mo 08:37, Joe Conway wrote:
> >> On 12/4/23 07:41, Andrew Dunstan wrote:
> >>>
> >>> On 2023-12-03 Su 20:14, Joe Conway wrote:
>  (please don't top quote on the Postgres lists)
> 
>  On 12/3/23 17:38, Davin Shearer wrote:
> > " being quoted as \\" breaks the JSON. It needs to be \".  This has
> > been my whole problem with COPY TO for JSON.
> >
> > Please validate that the output is in proper format with correct
> > quoting for special characters. I use `jq` on the command line to
> > validate and format the output.
> 
>  I just hooked existing "row-to-json machinery" up to the "COPY TO"
>  statement. If the output is wrong (just for for this use case?),
>  that would be a missing feature (or possibly a bug?).
> 
>  Davin -- how did you work around the issue with the way the built in
>  functions output JSON?
> 
>  Andrew -- comments/thoughts?
> >>>
> >>> I meant to mention this when I was making comments yesterday.
> >>>
> >>> The patch should not be using CopyAttributeOutText - it will try to
> >>> escape characters such as \, which produces the effect complained of
> >>> here, or else we need to change its setup so we have a way to inhibit
> >>> that escaping.
> >>
> >>
> >> Interesting.
> >>
> >> I am surprised this has never been raised as a problem with COPY TO
> >> before.
> >>
> >> Should the JSON output, as produced by composite_to_json(), be sent
> >> as-is with no escaping at all? If yes, is JSON somehow unique in this
> >> regard?
> >
> >
> > Text mode output is in such a form that it can be read back in using
> > text mode input. There's nothing special about JSON in this respect -
> > any text field will be escaped too. But output suitable for text mode
> > input is not what you're trying to produce here; you're trying to
> > produce valid JSON.
> >
> > So, yes, the result of composite_to_json, which is already suitably
> > escaped, should not be further escaped in this case.
>
> Gotcha.
>
> This patch version uses CopySendData() instead and includes
> documentation changes. Still lacks regression tests.
>
> Hopefully this looks better. Any other particular strings I ought to
> test with?
>
> 8<--
> test=# copy (select * from foo limit 4) to stdout (format json,
> force_array true);
> [
>   {"id":1,"f1":"line with \" in it:
> 1","f2":"2023-12-03T12:26:41.596053-05:00"}
> ,{"id":2,"f1":"line with ' in it:
> 2","f2":"2023-12-03T12:26:41.596173-05:00"}
> ,{"id":3,"f1":"line with \" in it:
> 3","f2":"2023-12-03T12:26:41.596179-05:00"}
> ,{"id":4,"f1":"line with ' in it:
> 4","f2":"2023-12-03T12:26:41.596182-05:00"}
> ]
> 8<--
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Joe Conway


On 12/4/23 09:25, Andrew Dunstan wrote:


On 2023-12-04 Mo 08:37, Joe Conway wrote:

On 12/4/23 07:41, Andrew Dunstan wrote:


On 2023-12-03 Su 20:14, Joe Conway wrote:

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has 
been my whole problem with COPY TO for JSON.


Please validate that the output is in proper format with correct 
quoting for special characters. I use `jq` on the command line to 
validate and format the output.


I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), 
that would be a missing feature (or possibly a bug?).


Davin -- how did you work around the issue with the way the built in 
functions output JSON?


Andrew -- comments/thoughts?


I meant to mention this when I was making comments yesterday.

The patch should not be using CopyAttributeOutText - it will try to
escape characters such as \, which produces the effect complained of
here, or else we need to change its setup so we have a way to inhibit
that escaping.



Interesting.

I am surprised this has never been raised as a problem with COPY TO 
before.


Should the JSON output, as produced by composite_to_json(), be sent 
as-is with no escaping at all? If yes, is JSON somehow unique in this 
regard?



Text mode output is in such a form that it can be read back in using
text mode input. There's nothing special about JSON in this respect -
any text field will be escaped too. But output suitable for text mode
input is not what you're trying to produce here; you're trying to
produce valid JSON.

So, yes, the result of composite_to_json, which is already suitably
escaped, should not be further escaped in this case.


Gotcha.

This patch version uses CopySendData() instead and includes 
documentation changes. Still lacks regression tests.


Hopefully this looks better. Any other particular strings I ought to 
test with?


8<--
test=# copy (select * from foo limit 4) to stdout (format json, 
force_array true);

[
 {"id":1,"f1":"line with \" in it: 
1","f2":"2023-12-03T12:26:41.596053-05:00"}
,{"id":2,"f1":"line with ' in it: 
2","f2":"2023-12-03T12:26:41.596173-05:00"}
,{"id":3,"f1":"line with \" in it: 
3","f2":"2023-12-03T12:26:41.596179-05:00"}
,{"id":4,"f1":"line with ' in it: 
4","f2":"2023-12-03T12:26:41.596182-05:00"}

]
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69..af8777b 100644
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*** COPY { ta
*** 43,48 
--- 43,50 
  FORCE_QUOTE { ( column_name [, ...] ) | * }
  FORCE_NOT_NULL { ( column_name [, ...] ) | * }
  FORCE_NULL { ( column_name [, ...] ) | * }
+ FORCE_ARRAY [ boolean ]
+ FORCE_ROW_DELIMITER [ boolean ]
  ENCODING 'encoding_name'
  
   
*** COPY { ta
*** 206,214 
--- 208,221 
Selects the data format to be read or written:
text,
csv (Comma Separated Values),
+   json (JavaScript Object Notation),
or binary.
The default is text.
   
+  
+   The json option is allowed only in
+   COPY TO.
+  
  
 
  
*** COPY { ta
*** 372,377 
--- 379,410 
   
  
 
+ 
+
+ FORCE_ROW_DELIMITER
+ 
+  
+   Force output of commas as row delimiters, in addition to the usual
+   end of line characters. This option is allowed only in
+   COPY TO, and only when using
+   JSON format.
+   The default is false.
+  
+ 
+
+ 
+
+ FORCE_ARRAY
+ 
+  
+   Force output of array decorations at the beginning and end of output.
+   This option implies the FORCE_ROW_DELIMITER
+   option. It is allowed only in COPY TO, and only
+   when using JSON format.
+   The default is false.
+  
+ 
+
  
 
  ENCODING
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..46ec34f 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 443,450 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 542,559 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_row_delimiter") == 0)
+ 		{
+ 			if (opts_out->force_row_delimiter)
+ errorConflictingDefElem(defel, pstate);
+

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Andrew Dunstan




On 2023-12-04 Mo 08:37, Joe Conway wrote:

On 12/4/23 07:41, Andrew Dunstan wrote:


On 2023-12-03 Su 20:14, Joe Conway wrote:

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has 
been my whole problem with COPY TO for JSON.


Please validate that the output is in proper format with correct 
quoting for special characters. I use `jq` on the command line to 
validate and format the output.


I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), 
that would be a missing feature (or possibly a bug?).


Davin -- how did you work around the issue with the way the built in 
functions output JSON?


Andrew -- comments/thoughts?


I meant to mention this when I was making comments yesterday.

The patch should not be using CopyAttributeOutText - it will try to
escape characters such as \, which produces the effect complained of
here, or else we need to change its setup so we have a way to inhibit
that escaping.



Interesting.

I am surprised this has never been raised as a problem with COPY TO 
before.


Should the JSON output, as produced by composite_to_json(), be sent 
as-is with no escaping at all? If yes, is JSON somehow unique in this 
regard?



Text mode output is in such a form that it can be read back in using 
text mode input. There's nothing special about JSON in this respect - 
any text field will be escaped too. But output suitable for text mode 
input is not what you're trying to produce here; you're trying to 
produce valid JSON.


So, yes, the result of composite_to_json, which is already suitably 
escaped, should not be further escaped in this case.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Joe Conway


On 12/4/23 07:41, Andrew Dunstan wrote:


On 2023-12-03 Su 20:14, Joe Conway wrote:

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has 
been my whole problem with COPY TO for JSON.


Please validate that the output is in proper format with correct 
quoting for special characters. I use `jq` on the command line to 
validate and format the output.


I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), that 
would be a missing feature (or possibly a bug?).


Davin -- how did you work around the issue with the way the built in 
functions output JSON?


Andrew -- comments/thoughts?


I meant to mention this when I was making comments yesterday.

The patch should not be using CopyAttributeOutText - it will try to
escape characters such as \, which produces the effect complained of
here, or else we need to change its setup so we have a way to inhibit
that escaping.



Interesting.

I am surprised this has never been raised as a problem with COPY TO before.

Should the JSON output, as produced by composite_to_json(), be sent 
as-is with no escaping at all? If yes, is JSON somehow unique in this 
regard?


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-04 Thread Andrew Dunstan




On 2023-12-03 Su 20:14, Joe Conway wrote:

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has 
been my whole problem with COPY TO for JSON.


Please validate that the output is in proper format with correct 
quoting for special characters. I use `jq` on the command line to 
validate and format the output.


I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), that 
would be a missing feature (or possibly a bug?).


Davin -- how did you work around the issue with the way the built in 
functions output JSON?


Andrew -- comments/thoughts?




I meant to mention this when I was making comments yesterday.

The patch should not be using CopyAttributeOutText - it will try to 
escape characters such as \, which produces the effect complained of 
here, or else we need to change its setup so we have a way to inhibit 
that escaping.



cheers


andrew







--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Davin Shearer

I worked around it by using select json_agg(t)... and redirecting it to
file via psql on the command line. COPY TO was working until we ran into
broken JSON and discovered the double quoting issue due to some values
containing " in them.

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway

(please don't top quote on the Postgres lists)

On 12/3/23 17:38, Davin Shearer wrote:
" being quoted as \\" breaks the JSON. It needs to be \".  This has been 
my whole problem with COPY TO for JSON.

Please validate that the output is in proper format with correct quoting 
for special characters. I use `jq` on the command line to validate and 
format the output.

I just hooked existing "row-to-json machinery" up to the "COPY TO" 
statement. If the output is wrong (just for for this use case?), that 
would be a missing feature (or possibly a bug?).

Davin -- how did you work around the issue with the way the built in 
functions output JSON?

Andrew -- comments/thoughts?

Joe

On Sun, Dec 3, 2023, 10:51 Joe Conway > wrote:

On 12/3/23 10:31, Davin Shearer wrote:
 > Please be sure to include single and double quotes in the test
values
 > since that was the original problem (double quoting in COPY TO
breaking
 > the JSON syntax).

test=# copy (select * from foo limit 4) to stdout (format json);
{"id":2456092,"f1":"line with ' in it:
2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
{"id":2456093,"f1":"line with \\" in it:
2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
{"id":2456094,"f1":"line with ' in it:
2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
{"id":2456095,"f1":"line with \\" in it:
2456095","f2":"2023-12-03T10:44:40.971228-05:00"}

-- 
Joe Conway

PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com 

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Davin Shearer

" being quoted as \\" breaks the JSON. It needs to be \".  This has been my
whole problem with COPY TO for JSON.

Please validate that the output is in proper format with correct quoting
for special characters. I use `jq` on the command line to validate and
format the output.

On Sun, Dec 3, 2023, 10:51 Joe Conway  wrote:

> On 12/3/23 10:31, Davin Shearer wrote:
> > Please be sure to include single and double quotes in the test values
> > since that was the original problem (double quoting in COPY TO breaking
> > the JSON syntax).
>
> test=# copy (select * from foo limit 4) to stdout (format json);
> {"id":2456092,"f1":"line with ' in it:
> 2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
> {"id":2456093,"f1":"line with \\" in it:
> 2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
> {"id":2456094,"f1":"line with ' in it:
> 2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
> {"id":2456095,"f1":"line with \\" in it:
> 2456095","f2":"2023-12-03T10:44:40.971228-05:00"}
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>
>

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 14:52, Andrew Dunstan wrote:


On 2023-12-03 Su 14:24, Joe Conway wrote:

On 12/3/23 11:03, Joe Conway wrote:

On 12/3/23 10:10, Andrew Dunstan wrote:
I  realize this is just a POC, but I'd prefer to see 
composite_to_json()

not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


Just FYI, this change does loose some performance in my not massively 
scientific A/B/A test:


8<---



8<---

That is somewhere in the 3% range.


I assume it's because datum_to_json() constructs a text value from which
you then need to extract the cstring, whereas composite_to_json(), just
gives you back the stringinfo. I guess that's a good enough reason to go
with exposing composite_to_json().


Yeah, that was why I went that route in the first place. If you are good 
with it I will go back to that. The code is a bit simpler too.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Andrew Dunstan




On 2023-12-03 Su 14:24, Joe Conway wrote:

On 12/3/23 11:03, Joe Conway wrote:

On 12/3/23 10:10, Andrew Dunstan wrote:
I  realize this is just a POC, but I'd prefer to see 
composite_to_json()

not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


Just FYI, this change does loose some performance in my not massively 
scientific A/B/A test:


8<---
-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 37196.898 ms (00:37.197)
Time: 37408.161 ms (00:37.408)
Time: 38393.309 ms (00:38.393)
Time: 36855.438 ms (00:36.855)
Time: 37806.280 ms (00:37.806)

Avg = 37532

-- original patch
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 37426.207 ms (00:37.426)
Time: 36068.187 ms (00:36.068)
Time: 38285.252 ms (00:38.285)
Time: 36971.042 ms (00:36.971)
Time: 35690.822 ms (00:35.691)

Avg = 36888

-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 39083.467 ms (00:39.083)
Time: 37249.326 ms (00:37.249)
Time: 38529.721 ms (00:38.530)
Time: 38704.920 ms (00:38.705)
Time: 39001.326 ms (00:39.001)

Avg = 38513
8<---

That is somewhere in the 3% range.



I assume it's because datum_to_json() constructs a text value from which 
you then need to extract the cstring, whereas composite_to_json(), just 
gives you back the stringinfo. I guess that's a good enough reason to go 
with exposing composite_to_json().



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Andrew Dunstan




On 2023-12-03 Su 12:11, Joe Conway wrote:

On 12/3/23 11:03, Joe Conway wrote:

  From your earlier post, regarding constructing the aggregate -- not
extensive testing but one data point:
8<--
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 36353.153 ms (00:36.353)
test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
COPY 1
Time: 46835.238 ms (00:46.835)
8<--


Also if the table is large enough, the aggregate method is not even 
feasible whereas the COPY TO method works:

8<--
test=# select count(*) from foo;
  count
--
 2000
(1 row)

test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073741822 bytes by 1 
more bytes.


test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 2000
8<--



None of this is surprising. As I mentioned, limitations with json_agg() 
are why I support the idea of this patch.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 11:03, Joe Conway wrote:

On 12/3/23 10:10, Andrew Dunstan wrote:

I  realize this is just a POC, but I'd prefer to see composite_to_json()
not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


Just FYI, this change does loose some performance in my not massively 
scientific A/B/A test:


8<---
-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 37196.898 ms (00:37.197)
Time: 37408.161 ms (00:37.408)
Time: 38393.309 ms (00:38.393)
Time: 36855.438 ms (00:36.855)
Time: 37806.280 ms (00:37.806)

Avg = 37532

-- original patch
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 37426.207 ms (00:37.426)
Time: 36068.187 ms (00:36.068)
Time: 38285.252 ms (00:38.285)
Time: 36971.042 ms (00:36.971)
Time: 35690.822 ms (00:35.691)

Avg = 36888

-- with datum_to_json()
test=# \timing
Timing is on.
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 39083.467 ms (00:39.083)
Time: 37249.326 ms (00:37.249)
Time: 38529.721 ms (00:38.530)
Time: 38704.920 ms (00:38.705)
Time: 39001.326 ms (00:39.001)

Avg = 38513
8<---

That is somewhere in the 3% range.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 11:03, Joe Conway wrote:

  From your earlier post, regarding constructing the aggregate -- not
extensive testing but one data point:
8<--
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 36353.153 ms (00:36.353)
test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
COPY 1
Time: 46835.238 ms (00:46.835)
8<--


Also if the table is large enough, the aggregate method is not even 
feasible whereas the COPY TO method works:

8<--
test=# select count(*) from foo;
  count
--
 2000
(1 row)

test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073741822 bytes by 1 
more bytes.


test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 2000
8<--

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 10:10, Andrew Dunstan wrote:


On 2023-12-01 Fr 14:28, Joe Conway wrote:

On 11/29/23 10:32, Davin Shearer wrote:

Thanks for the responses everyone.

I worked around the issue using the `psql -tc` method as Filip 
described.


I think it would be great to support writing JSON using COPY TO at 
some point so I can emit JSON to files using a PostgreSQL function 
directly.


-Davin

On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák > wrote:


    This would be a very special case for COPY. It applies only to a 
single

    column of JSON values. The original problem can be solved with psql
    --tuples-only as David wrote earlier.


    $ psql -tc 'select json_agg(row_to_json(t))
               from (select * from public.tbl_json_test) t;'

   [{"id":1,"t_test":"here's a \"string\""}]


    Special-casing any encoding/escaping scheme leads to bugs and harder
    parsing.


(moved to hackers)

I did a quick PoC patch (attached) -- if there interest and no hard 
objections I would like to get it up to speed for the January commitfest.


Currently the patch lacks documentation and regression test support.

Questions:
--
1. Is supporting JSON array format sufficient, or does it need to 
support some other options? How flexible does the support scheme need 
to be?


2. This only supports COPY TO and we would undoubtedly want to support 
COPY FROM for JSON as well, but is that required from the start?


Thanks for any feedback.


I  realize this is just a POC, but I'd prefer to see composite_to_json()
not exposed. You could use the already public datum_to_json() instead,
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
arguments.


Ok, thanks, will do


I think JSON array format is sufficient.


The other formats make sense from a completeness standpoint (versus 
other databases) and the latest patch already includes them, so I still 
lean toward supporting all three formats.



I can see both sides of the COPY FROM argument, but I think insisting on
that makes this less doable for release 17. On balance I would stick to
COPY TO for now.


WFM.

From your earlier post, regarding constructing the aggregate -- not 
extensive testing but one data point:

8<--
test=# copy foo to '/tmp/buf' (format json, force_array);
COPY 1000
Time: 36353.153 ms (00:36.353)
test=# copy (select json_agg(foo) from foo) to '/tmp/buf';
COPY 1
Time: 46835.238 ms (00:46.835)
8<--


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/3/23 10:31, Davin Shearer wrote:
Please be sure to include single and double quotes in the test values 
since that was the original problem (double quoting in COPY TO breaking 
the JSON syntax).


test=# copy (select * from foo limit 4) to stdout (format json);
{"id":2456092,"f1":"line with ' in it: 
2456092","f2":"2023-12-03T10:44:40.9712-05:00"}
{"id":2456093,"f1":"line with \\" in it: 
2456093","f2":"2023-12-03T10:44:40.971221-05:00"}
{"id":2456094,"f1":"line with ' in it: 
2456094","f2":"2023-12-03T10:44:40.971225-05:00"}
{"id":2456095,"f1":"line with \\" in it: 
2456095","f2":"2023-12-03T10:44:40.971228-05:00"}


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Davin Shearer

Please be sure to include single and double quotes in the test values since
that was the original problem (double quoting in COPY TO breaking the JSON
syntax).

On Sun, Dec 3, 2023, 10:11 Andrew Dunstan  wrote:

>
> On 2023-12-01 Fr 14:28, Joe Conway wrote:
> > On 11/29/23 10:32, Davin Shearer wrote:
> >> Thanks for the responses everyone.
> >>
> >> I worked around the issue using the `psql -tc` method as Filip
> >> described.
> >>
> >> I think it would be great to support writing JSON using COPY TO at
> >> some point so I can emit JSON to files using a PostgreSQL function
> >> directly.
> >>
> >> -Davin
> >>
> >> On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák  >> > wrote:
> >>
> >> This would be a very special case for COPY. It applies only to a
> >> single
> >> column of JSON values. The original problem can be solved with psql
> >> --tuples-only as David wrote earlier.
> >>
> >>
> >> $ psql -tc 'select json_agg(row_to_json(t))
> >>from (select * from public.tbl_json_test) t;'
> >>
> >>[{"id":1,"t_test":"here's a \"string\""}]
> >>
> >>
> >> Special-casing any encoding/escaping scheme leads to bugs and harder
> >> parsing.
> >
> > (moved to hackers)
> >
> > I did a quick PoC patch (attached) -- if there interest and no hard
> > objections I would like to get it up to speed for the January commitfest.
> >
> > Currently the patch lacks documentation and regression test support.
> >
> > Questions:
> > --
> > 1. Is supporting JSON array format sufficient, or does it need to
> > support some other options? How flexible does the support scheme need
> > to be?
> >
> > 2. This only supports COPY TO and we would undoubtedly want to support
> > COPY FROM for JSON as well, but is that required from the start?
> >
> > Thanks for any feedback.
>
>
> I  realize this is just a POC, but I'd prefer to see composite_to_json()
> not exposed. You could use the already public datum_to_json() instead,
> passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third
> arguments.
>
> I think JSON array format is sufficient.
>
> I can see both sides of the COPY FROM argument, but I think insisting on
> that makes this less doable for release 17. On balance I would stick to
> COPY TO for now.
>
>
> cheers
>
>
> andrew
>
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Andrew Dunstan




On 2023-12-01 Fr 14:28, Joe Conway wrote:

On 11/29/23 10:32, Davin Shearer wrote:

Thanks for the responses everyone.

I worked around the issue using the `psql -tc` method as Filip 
described.


I think it would be great to support writing JSON using COPY TO at 
some point so I can emit JSON to files using a PostgreSQL function 
directly.


-Davin

On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák > wrote:


    This would be a very special case for COPY. It applies only to a 
single

    column of JSON values. The original problem can be solved with psql
    --tuples-only as David wrote earlier.


    $ psql -tc 'select json_agg(row_to_json(t))
               from (select * from public.tbl_json_test) t;'

   [{"id":1,"t_test":"here's a \"string\""}]


    Special-casing any encoding/escaping scheme leads to bugs and harder
    parsing.


(moved to hackers)

I did a quick PoC patch (attached) -- if there interest and no hard 
objections I would like to get it up to speed for the January commitfest.


Currently the patch lacks documentation and regression test support.

Questions:
--
1. Is supporting JSON array format sufficient, or does it need to 
support some other options? How flexible does the support scheme need 
to be?


2. This only supports COPY TO and we would undoubtedly want to support 
COPY FROM for JSON as well, but is that required from the start?


Thanks for any feedback.



I  realize this is just a POC, but I'd prefer to see composite_to_json() 
not exposed. You could use the already public datum_to_json() instead, 
passing JSONTYPE_COMPOSITE and F_RECORD_OUT as the second and third 
arguments.


I think JSON array format is sufficient.

I can see both sides of the COPY FROM argument, but I think insisting on 
that makes this less doable for release 17. On balance I would stick to 
COPY TO for now.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Emitting JSON to file using COPY TO

2023-12-03 Thread Joe Conway


On 12/2/23 17:37, Joe Conway wrote:

On 12/2/23 16:53, Nathan Bossart wrote:

On Sat, Dec 02, 2023 at 10:11:20AM -0500, Tom Lane wrote:

So if you are writing a production that might need to match
FORMAT followed by JSON, you need to match FORMAT_LA too.


Thanks for the pointer.  That does seem to be the culprit.

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index d631ac89a9..048494dd07 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -3490,6 +3490,10 @@ copy_generic_opt_elem:
  {
  $$ = makeDefElem($1, $2, @1);
  }
+| FORMAT_LA copy_generic_opt_arg
+{
+$$ = makeDefElem("format", $2, @1);
+}
  ;
  
  copy_generic_opt_arg:



Yep -- I concluded the same. Thanks Tom!


The attached implements the above repair, as well as adding support for 
array decoration (or not) and/or comma row delimiters when not an array.


This covers the three variations of json import/export formats that I 
have found after light searching (SQL Server and DuckDB).


Still lacks and documentation, tests, and COPY FROM support, but here is 
what it looks like in a nutshell:


8<---
create table foo(id int8, f1 text, f2 timestamptz);
insert into foo
  select g.i,
 'line: ' || g.i::text,
 clock_timestamp()
  from generate_series(1,4) as g(i);

copy foo to stdout (format json);
{"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}

copy foo to stdout (format json, force_array);
[
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}
]

copy foo to stdout (format json, force_row_delimiter);
 {"id":1,"f1":"line: 1","f2":"2023-12-01T12:58:16.776863-05:00"}
,{"id":2,"f1":"line: 2","f2":"2023-12-01T12:58:16.777084-05:00"}
,{"id":3,"f1":"line: 3","f2":"2023-12-01T12:58:16.777096-05:00"}
,{"id":4,"f1":"line: 4","f2":"2023-12-01T12:58:16.777103-05:00"}

copy foo to stdout (force_array);
ERROR:  COPY FORCE_ARRAY requires JSON mode

copy foo to stdout (force_row_delimiter);
ERROR:  COPY FORCE_ROW_DELIMITER requires JSON mode
8<---


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfad47b..1f9ac31 100644
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
*** ProcessCopyOptions(ParseState *pstate,
*** 443,448 
--- 443,450 
   /* default format */ ;
  			else if (strcmp(fmt, "csv") == 0)
  opts_out->csv_mode = true;
+ 			else if (strcmp(fmt, "json") == 0)
+ opts_out->json_mode = true;
  			else if (strcmp(fmt, "binary") == 0)
  opts_out->binary = true;
  			else
*** ProcessCopyOptions(ParseState *pstate,
*** 540,545 
--- 542,559 
  defel->defname),
  		 parser_errposition(pstate, defel->location)));
  		}
+ 		else if (strcmp(defel->defname, "force_row_delimiter") == 0)
+ 		{
+ 			if (opts_out->force_row_delimiter)
+ errorConflictingDefElem(defel, pstate);
+ 			opts_out->force_row_delimiter = true;
+ 		}
+ 		else if (strcmp(defel->defname, "force_array") == 0)
+ 		{
+ 			if (opts_out->force_array)
+ errorConflictingDefElem(defel, pstate);
+ 			opts_out->force_array = true;
+ 		}
  		else if (strcmp(defel->defname, "convert_selectively") == 0)
  		{
  			/*
*** ProcessCopyOptions(ParseState *pstate,
*** 598,603 
--- 612,631 
  (errcode(ERRCODE_SYNTAX_ERROR),
   errmsg("cannot specify DEFAULT in BINARY mode")));
  
+ 	if (opts_out->json_mode)
+ 	{
+ 		if (opts_out->force_array)
+ 			opts_out->force_row_delimiter = true;
+ 	}
+ 	else if (opts_out->force_array)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("COPY FORCE_ARRAY requires JSON mode")));
+ 	else if (opts_out->force_row_delimiter)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("COPY FORCE_ROW_DELIMITER requires JSON mode")));
+ 
  	/* Set defaults for omitted options */
  	if (!opts_out->delim)
  		opts_out->delim = opts_out->csv_mode ? "," : "\t";
*** ProcessCopyOptions(ParseState *pstate,
*** 667,672 
--- 695,705 
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
   errmsg("cannot specify HEADER in BINARY mode")));
  
+ 	if (opts_out->json_mode && opts_out->header_line)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+  errmsg("cannot specify HEADER in

1 2 >

1 - 100 of 111 matches

Mail list logo