Re: Small fix on COPY ON_ERROR document

2024-02-04 Thread Yugo NAGATA
On Sat, 3 Feb 2024 01:51:56 +0200
Alexander Korotkov  wrote:

> On Fri, Feb 2, 2024 at 10:07 PM David G. Johnston
>  wrote:
> > On Thu, Feb 1, 2024 at 11:59 PM Yugo NAGATA  wrote:
> >>
> >>
> >> I attached a updated patch including fixes you pointed out above.
> >>
> >
> > Removed "which"; changed "occupying" to "occupy"
> > Removed on of the two "amounts"
> > Changed "unacceptable to the input function" to just "converting" as that 
> > is what the average reader is more likely to be thinking.
> > The rest was good.
> 
> Thanks to everybody in the thread.
> Pushed.

Thanks!
> 
> --
> Regards,
> Alexander Korotkov


-- 
Yugo NAGATA 




Re: Small fix on COPY ON_ERROR document

2024-02-02 Thread Alexander Korotkov
On Fri, Feb 2, 2024 at 10:07 PM David G. Johnston
 wrote:
> On Thu, Feb 1, 2024 at 11:59 PM Yugo NAGATA  wrote:
>>
>>
>> I attached a updated patch including fixes you pointed out above.
>>
>
> Removed "which"; changed "occupying" to "occupy"
> Removed on of the two "amounts"
> Changed "unacceptable to the input function" to just "converting" as that is 
> what the average reader is more likely to be thinking.
> The rest was good.

Thanks to everybody in the thread.
Pushed.

--
Regards,
Alexander Korotkov




Re: Small fix on COPY ON_ERROR document

2024-02-02 Thread David G. Johnston
On Thu, Feb 1, 2024 at 11:59 PM Yugo NAGATA  wrote:

>
> I attached a updated patch including fixes you pointed out above.
>
>
Removed "which"; changed "occupying" to "occupy"
Removed on of the two "amounts"
Changed "unacceptable to the input function" to just "converting" as that
is what the average reader is more likely to be thinking.
The rest was good.

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 3c2feaa11a..55764fc1f2 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -385,8 +385,8 @@ COPY { table_name [ ( ON_ERROR
 
  
-  Specifies which how to behave when encountering an error due to
column values
-  unacceptable to the input function of each attribute's data type.
+  Specifies how to behave when encountering an error converting a
column's
+  input value into its data type.
   An error_action value of
   stop means fail the command, while
   ignore means discard the input row and continue
with the next one.
@@ -589,7 +589,7 @@ COPY count
 The COPY FROM command physically inserts input rows
 into the table as it progresses.  If the command fails, these rows are
 left in a deleted state; these rows will not be visible, but still
-occupying disk space. This might amount to a considerable amount of
+occupy disk space. This might amount to considerable
 wasted disk space if the failure happened well into a large copy
 operation. VACUUM should be used to recover the
 wasted space.

David J.


Re: Small fix on COPY ON_ERROR document

2024-02-01 Thread Yugo NAGATA
On Fri, 02 Feb 2024 11:29:41 +0900
torikoshia  wrote:

> On 2024-02-01 15:16, Yugo NAGATA wrote:
> > On Mon, 29 Jan 2024 15:47:25 +0900
> > Yugo NAGATA  wrote:
> > 
> >> On Sun, 28 Jan 2024 19:14:58 -0700
> >> "David G. Johnston"  wrote:
> >> 
> >> > > Also, I think "invalid input syntax" is a bit ambiguous. For example,
> >> > > COPY FROM raises an error when the number of input column does not 
> >> > > match
> >> > > to the table schema, but this error is not ignored by ON_ERROR while
> >> > > this seems to fall into the category of "invalid input syntax".
> >> >
> >> >
> >> >
> >> > It is literally the error text that appears if one were not to ignore it.
> >> > It isn’t a category of errors.  But I’m open to ideas here.  But being
> >> > explicit with what on actually sees in the system seemed preferable to
> >> > inventing new classification terms not otherwise used.
> >> 
> >> Thank you for explanation! I understood the words was from the error 
> >> messages
> >> that users actually see. However, as Torikoshi-san said in [1], errors 
> >> other
> >> than valid input syntax (e.g. range error) can be also ignored, 
> >> therefore it
> >> would be better to describe to be ignored errors more specifically.
> >> 
> >> [1] 
> >> https://www.postgresql.org/message-id/7f1457497fa3bf9dfe486f162d1c8ec6%40oss.nttdata.com
> >> 
> >> >
> >> > >
> >> > > So, keeping consistency with the existing description, we can say:
> >> > >
> >> > > "Specifies which how to behave when encountering an error due to
> >> > >  column values unacceptable to the input function of each attribute's
> >> > >  data type."
> >> >
> >> >
> >> > Yeah, I was considering something along those lines as an option as well.
> >> > But I’d rather add that wording to the glossary.
> >> 
> >> Although I am still be not convinced if we have to introduce the words
> >> "soft error" to the documentation, I don't care it if there are no 
> >> other
> >> opposite opinions.
> > 
> > Attached is a updated patch v3, which is a version that uses the above
> > wording instead of "soft error".
> > 
> >> >
> >> > > Currently, ON_ERROR doesn't support other soft errors, so it can 
> >> > > explain
> >> > > it more simply without introducing the new concept, "soft error" to 
> >> > > users.
> >> > >
> >> > >
> >> > Good point.  Seems we should define what user-facing errors are ignored
> >> > anywhere in the system and if we aren’t consistently leveraging these in
> >> > all areas/commands make the necessary qualifications in those specific
> >> > places.
> >> >
> >> 
> >> > > I think "left in a deleted state" is also unclear for users because 
> >> > > this
> >> > > explains the internal state but not how looks from user's view.How 
> >> > > about
> >> > > leaving the explanation "These rows will not be visible or accessible" 
> >> > > in
> >> > > the existing statement?
> >> > >
> >> >
> >> > Just visible then, I don’t like an “or” there and as tuples at least they
> >> > are accessible to the system, in vacuum especially.  But I expected the
> >> > user to understand “as if you deleted it” as their operational concept 
> >> > more
> >> > readily than visible.  I think this will be read by people who haven’t 
> >> > read
> >> > MVCC to fully understand what visible means but know enough to run vacuum
> >> > to clean up updated and deleted data as a rule.
> >> 
> >> Ok, I agree we can omit "or accessible". How do you like the 
> >> followings?
> >> Still redundant?
> >> 
> >>  "If the command fails, these rows are left in a deleted state;
> >>   these rows will not be visible, but they still occupy disk space. "
> > 
> > Also, the above statement is used in the patch.
> 
> Thanks for updating the patch!
> 
> I like your description which doesn't use the word soft error.

Thank you for your comments!

> 
> Here are minor comments:
> 
> > +  ignore means discard the input row and 
> > continue with the next one.
> > +  The default is stop
> 
> Is "." required at the end of the line?
>
> >  An NOTICE level context message containing the 
> > ignored row count is
> 
> Should 'An' be 'A'?
> 
> Also, I wasn't sure the necessity of 'context'.
> It might be possible to just say "A NOTICE message containing the 
> ignored row count.."
> considering below existing descriptions:
> 
>doc/src/sgml/pltcl.sgml: a NOTICE message each 
> time a supported command is
>doc/src/sgml/pltcl.sgml- executed:
> 
>doc/src/sgml/plpgsql.sgml: This example trigger simply raises a 
> NOTICE message
>doc/src/sgml/plpgsql.sgml- each time a supported command is 
> executed.

I attached a updated patch including fixes you pointed out above.

Regards,
Yugo Nagata

> -- 
> Regards,
> 
> --
> Atsushi Torikoshi
> NTT DATA Group Corporation


-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 21a5c4a052..3c2feaa11a 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -90,6 +90,13 @@ COPY { 

Re: Small fix on COPY ON_ERROR document

2024-02-01 Thread torikoshia

On 2024-02-01 15:16, Yugo NAGATA wrote:

On Mon, 29 Jan 2024 15:47:25 +0900
Yugo NAGATA  wrote:


On Sun, 28 Jan 2024 19:14:58 -0700
"David G. Johnston"  wrote:

> > Also, I think "invalid input syntax" is a bit ambiguous. For example,
> > COPY FROM raises an error when the number of input column does not match
> > to the table schema, but this error is not ignored by ON_ERROR while
> > this seems to fall into the category of "invalid input syntax".
>
>
>
> It is literally the error text that appears if one were not to ignore it.
> It isn’t a category of errors.  But I’m open to ideas here.  But being
> explicit with what on actually sees in the system seemed preferable to
> inventing new classification terms not otherwise used.

Thank you for explanation! I understood the words was from the error 
messages
that users actually see. However, as Torikoshi-san said in [1], errors 
other
than valid input syntax (e.g. range error) can be also ignored, 
therefore it

would be better to describe to be ignored errors more specifically.

[1] 
https://www.postgresql.org/message-id/7f1457497fa3bf9dfe486f162d1c8ec6%40oss.nttdata.com


>
> >
> > So, keeping consistency with the existing description, we can say:
> >
> > "Specifies which how to behave when encountering an error due to
> >  column values unacceptable to the input function of each attribute's
> >  data type."
>
>
> Yeah, I was considering something along those lines as an option as well.
> But I’d rather add that wording to the glossary.

Although I am still be not convinced if we have to introduce the words
"soft error" to the documentation, I don't care it if there are no 
other

opposite opinions.


Attached is a updated patch v3, which is a version that uses the above
wording instead of "soft error".


>
> > Currently, ON_ERROR doesn't support other soft errors, so it can explain
> > it more simply without introducing the new concept, "soft error" to users.
> >
> >
> Good point.  Seems we should define what user-facing errors are ignored
> anywhere in the system and if we aren’t consistently leveraging these in
> all areas/commands make the necessary qualifications in those specific
> places.
>

> > I think "left in a deleted state" is also unclear for users because this
> > explains the internal state but not how looks from user's view.How about
> > leaving the explanation "These rows will not be visible or accessible" in
> > the existing statement?
> >
>
> Just visible then, I don’t like an “or” there and as tuples at least they
> are accessible to the system, in vacuum especially.  But I expected the
> user to understand “as if you deleted it” as their operational concept more
> readily than visible.  I think this will be read by people who haven’t read
> MVCC to fully understand what visible means but know enough to run vacuum
> to clean up updated and deleted data as a rule.

Ok, I agree we can omit "or accessible". How do you like the 
followings?

Still redundant?

 "If the command fails, these rows are left in a deleted state;
  these rows will not be visible, but they still occupy disk space. "


Also, the above statement is used in the patch.


Thanks for updating the patch!

I like your description which doesn't use the word soft error.


Here are minor comments:

+  ignore means discard the input row and 
continue with the next one.

+  The default is stop


Is "." required at the end of the line?

 An NOTICE level context message containing the 
ignored row count is


Should 'An' be 'A'?

Also, I wasn't sure the necessity of 'context'.
It might be possible to just say "A NOTICE message containing the 
ignored row count.."

considering below existing descriptions:

  doc/src/sgml/pltcl.sgml: a NOTICE message each 
time a supported command is

  doc/src/sgml/pltcl.sgml- executed:

  doc/src/sgml/plpgsql.sgml: This example trigger simply raises a 
NOTICE message
  doc/src/sgml/plpgsql.sgml- each time a supported command is 
executed.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: Small fix on COPY ON_ERROR document

2024-01-31 Thread Yugo NAGATA
On Mon, 29 Jan 2024 15:47:25 +0900
Yugo NAGATA  wrote:

> On Sun, 28 Jan 2024 19:14:58 -0700
> "David G. Johnston"  wrote:
> 
> > > Also, I think "invalid input syntax" is a bit ambiguous. For example,
> > > COPY FROM raises an error when the number of input column does not match
> > > to the table schema, but this error is not ignored by ON_ERROR while
> > > this seems to fall into the category of "invalid input syntax".
> > 
> > 
> > 
> > It is literally the error text that appears if one were not to ignore it.
> > It isn’t a category of errors.  But I’m open to ideas here.  But being
> > explicit with what on actually sees in the system seemed preferable to
> > inventing new classification terms not otherwise used.
> 
> Thank you for explanation! I understood the words was from the error messages
> that users actually see. However, as Torikoshi-san said in [1], errors other
> than valid input syntax (e.g. range error) can be also ignored, therefore it
> would be better to describe to be ignored errors more specifically.
> 
> [1] 
> https://www.postgresql.org/message-id/7f1457497fa3bf9dfe486f162d1c8ec6%40oss.nttdata.com
> 
> > 
> > >
> > > So, keeping consistency with the existing description, we can say:
> > >
> > > "Specifies which how to behave when encountering an error due to
> > >  column values unacceptable to the input function of each attribute's
> > >  data type."
> > 
> > 
> > Yeah, I was considering something along those lines as an option as well.
> > But I’d rather add that wording to the glossary.
> 
> Although I am still be not convinced if we have to introduce the words
> "soft error" to the documentation, I don't care it if there are no other
> opposite opinions. 

Attached is a updated patch v3, which is a version that uses the above
wording instead of "soft error".

> > 
> > > Currently, ON_ERROR doesn't support other soft errors, so it can explain
> > > it more simply without introducing the new concept, "soft error" to users.
> > >
> > >
> > Good point.  Seems we should define what user-facing errors are ignored
> > anywhere in the system and if we aren’t consistently leveraging these in
> > all areas/commands make the necessary qualifications in those specific
> > places.
> > 
> 
> > > I think "left in a deleted state" is also unclear for users because this
> > > explains the internal state but not how looks from user's view.How about
> > > leaving the explanation "These rows will not be visible or accessible" in
> > > the existing statement?
> > >
> > 
> > Just visible then, I don’t like an “or” there and as tuples at least they
> > are accessible to the system, in vacuum especially.  But I expected the
> > user to understand “as if you deleted it” as their operational concept more
> > readily than visible.  I think this will be read by people who haven’t read
> > MVCC to fully understand what visible means but know enough to run vacuum
> > to clean up updated and deleted data as a rule.
> 
> Ok, I agree we can omit "or accessible". How do you like the followings?
> Still redundant?
> 
>  "If the command fails, these rows are left in a deleted state;
>   these rows will not be visible, but they still occupy disk space. "

Also, the above statement is used in the patch.

Regards,
Yugo Nagata

 
> Regards,
> Yugo Nagata
> 
> -- 
> Yugo NAGATA 
> 
> 


-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 21a5c4a052..10cfc3f0ad 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -90,6 +90,13 @@ COPY { table_name [ ( pg_stat_progress_copy view. See
  for details.
   
+
+  
+By default, COPY will fail if it encounters an error
+during processing. For use cases where a best-effort attempt at loading
+the entire file is desired, the ON_ERROR clause can
+be used to specify some other behavior.
+  
  
 
  
@@ -378,17 +385,20 @@ COPY { table_name [ ( ON_ERROR
 
  
-  Specifies which 
-  error_action to perform when there is malformed data in the input.
-  Currently, only stop (default) and ignore
-  values are supported.
-  If the stop value is specified,
-  COPY stops operation at the first error.
-  If the ignore value is specified,
-  COPY skips malformed data and continues copying data.
-  The option is allowed only in COPY FROM.
-  Only stop value is allowed when
-  using binary format.
+  Specifies which how to behave when encountering an error due to column values
+  unacceptable to the input function of each attribute's data type.
+  An error_action value of
+  stop means fail the command, while
+  ignore means discard the input row and continue with the next one.
+  The default is stop
+ 
+ 
+  The ignore option is applicable only for COPY FROM
+  when the FORMAT is text or csv.
+ 
+ 
+  An NOTICE level context message containing the ignored row count is
+  emitted at the end of the COPY FROM 

Re: Small fix on COPY ON_ERROR document

2024-01-28 Thread Yugo NAGATA
On Sun, 28 Jan 2024 19:14:58 -0700
"David G. Johnston"  wrote:

> > Also, I think "invalid input syntax" is a bit ambiguous. For example,
> > COPY FROM raises an error when the number of input column does not match
> > to the table schema, but this error is not ignored by ON_ERROR while
> > this seems to fall into the category of "invalid input syntax".
> 
> 
> 
> It is literally the error text that appears if one were not to ignore it.
> It isn’t a category of errors.  But I’m open to ideas here.  But being
> explicit with what on actually sees in the system seemed preferable to
> inventing new classification terms not otherwise used.

Thank you for explanation! I understood the words was from the error messages
that users actually see. However, as Torikoshi-san said in [1], errors other
than valid input syntax (e.g. range error) can be also ignored, therefore it
would be better to describe to be ignored errors more specifically.

[1] 
https://www.postgresql.org/message-id/7f1457497fa3bf9dfe486f162d1c8ec6%40oss.nttdata.com

> 
> >
> > So, keeping consistency with the existing description, we can say:
> >
> > "Specifies which how to behave when encountering an error due to
> >  column values unacceptable to the input function of each attribute's
> >  data type."
> 
> 
> Yeah, I was considering something along those lines as an option as well.
> But I’d rather add that wording to the glossary.

Although I am still be not convinced if we have to introduce the words
"soft error" to the documentation, I don't care it if there are no other
opposite opinions. 

> 
> > Currently, ON_ERROR doesn't support other soft errors, so it can explain
> > it more simply without introducing the new concept, "soft error" to users.
> >
> >
> Good point.  Seems we should define what user-facing errors are ignored
> anywhere in the system and if we aren’t consistently leveraging these in
> all areas/commands make the necessary qualifications in those specific
> places.
> 

> > I think "left in a deleted state" is also unclear for users because this
> > explains the internal state but not how looks from user's view.How about
> > leaving the explanation "These rows will not be visible or accessible" in
> > the existing statement?
> >
> 
> Just visible then, I don’t like an “or” there and as tuples at least they
> are accessible to the system, in vacuum especially.  But I expected the
> user to understand “as if you deleted it” as their operational concept more
> readily than visible.  I think this will be read by people who haven’t read
> MVCC to fully understand what visible means but know enough to run vacuum
> to clean up updated and deleted data as a rule.

Ok, I agree we can omit "or accessible". How do you like the followings?
Still redundant?

 "If the command fails, these rows are left in a deleted state;
  these rows will not be visible, but they still occupy disk space. "

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: Small fix on COPY ON_ERROR document

2024-01-28 Thread torikoshia

On 2024-01-27 00:04, David G. Johnston wrote:

On Fri, Jan 26, 2024 at 2:30 AM Yugo NAGATA 
wrote:


On Fri, 26 Jan 2024 00:00:57 -0700
"David G. Johnston"  wrote:


I will need to make this tweak and probably a couple others to my

own

suggestions in 12 hours or so.



And here is my v2.

Notably I choose to introduce the verbiage "soft error" and then
define in the ON_ERROR clause the specific soft error that matters
here - "invalid input syntax".

I also note the log message behavior when ignore mode is chosen.  I
haven't confirmed that it is accurate but that is readily tweaked if
approved of.

David J.


Thanks for refining the doc.



+  Specifies which how to behave when encountering a soft error.


To be consistent with other parts in the manual[1][2], should be “soft” 
error?


+  An error_action 
value of

+  stop means fail the command, while
+  ignore means discard the input row and 
continue with the next one.

+  The default is stop


Is "." required at the end of the line?

+ 
+  The only relevant soft error is "invalid input syntax", which 
manifests when attempting

+  to create a column value from the text input.
+ 

I think it is not restricted to "invalid input syntax".
We can handle out of range error:

  =# create table t1(i int);
  CREATE TABLE

  =# copy t1  from stdin with(ON_ERROR ignore);
  Enter data to be copied followed by a newline.
  End with a backslash and a period on a line by itself, or an EOF
  signal.
  >> 1
  >> \.
  NOTICE:  1 row was skipped due to data type incompatibility
  COPY 0


Also, I'm a little concerned that users might wonder what soft error is.

Certainly there are already references to "soft" errors in the manual, 
but they seem to be for developer, such as creating new TYPE for 
PostgreSQL.


It might be better to describe what soft error is like below:


-- src/backend/utils/fmgr/README
An error reported "softly" must be safe, in the sense that there is
no question about our ability to continue normal processing of the
transaction.


[1] https://www.postgresql.org/docs/devel/sql-createtype.html
[2] https://www.postgresql.org/docs/devel/functions-info.html

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: Small fix on COPY ON_ERROR document

2024-01-28 Thread Yugo NAGATA
On Fri, 26 Jan 2024 08:04:45 -0700
"David G. Johnston"  wrote:

> On Fri, Jan 26, 2024 at 2:30 AM Yugo NAGATA  wrote:
> 
> > On Fri, 26 Jan 2024 00:00:57 -0700
> > "David G. Johnston"  wrote:
> >
> > > I will need to make this tweak and probably a couple others to my own
> > > suggestions in 12 hours or so.
> > >
> >
> >
> And here is my v2.
> 
> Notably I choose to introduce the verbiage "soft error" and then define in
> the ON_ERROR clause the specific soft error that matters here - "invalid
> input syntax".

I am not sure we should use "soft error" without any explanation
because it seems to me that the meaning of words is unclear for users. 

This is explained in src/backend/utils/fmgr/README;

 * Some callers of datatype input functions (and in future perhaps
 other classes of functions) pass an instance of ErrorSaveContext.
 This indicates that the caller wishes to handle "soft" errors without
 a transaction-terminating exception being thrown: instead, the callee
 should store information about the error cause in the ErrorSaveContext
 struct and return a dummy result value. 

However, this is not mentioned in the documentation anywhere. So, it
would be hard for users to understand what "soft error" is because
they would not read README.

Also, I think "invalid input syntax" is a bit ambiguous. For example, 
COPY FROM raises an error when the number of input column does not match
to the table schema, but this error is not ignored by ON_ERROR while
this seems to fall into the category of "invalid input syntax".

I found the description as followings in the documentation in COPY:

 The column values themselves are strings (...) acceptable to the input
 function, of each attribute's data type.

So, keeping consistency with the existing description, we can say:

"Specifies which how to behave when encountering an error due to
 column values unacceptable to the input function of each attribute's
 data type."

Currently, ON_ERROR doesn't support other soft errors, so it can explain
it more simply without introducing the new concept, "soft error" to users.

> I also note the log message behavior when ignore mode is chosen.  I haven't
> confirmed that it is accurate but that is readily tweaked if approved of.
> 

+  An INFO level context message containing the ignored 
row count is
+  emitted at the end of the COPY FROM if at least one 
row was discarded.


The log level is NOTICE not INFO.


-COPY stops operation at the first error when
-ON_ERROR is not specified. This
-should not lead to problems in the event of a COPY
-TO, but the target table will already have received
-earlier rows in a COPY FROM. These rows will not
-be visible or accessible, but they still occupy disk space. This might
+The COPY FROM command physically inserts input rows
+into the table as it progresses.  If the command fails these rows are
+left in a deleted state, still occupying disk space. This might
 amount to a considerable amount of wasted disk space if the failure
-happened well into a large copy operation. You might wish to invoke
-VACUUM to recover the wasted space.
+happened well into a large copy operation. VACUUM
+should be used to recover the wasted space.


I think "left in a deleted state" is also unclear for users because this
explains the internal state but not how looks from user's view.How about
leaving the explanation "These rows will not be visible or accessible" in
the existing statement?

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: Small fix on COPY ON_ERROR document

2024-01-26 Thread David G. Johnston
On Fri, Jan 26, 2024 at 2:30 AM Yugo NAGATA  wrote:

> On Fri, 26 Jan 2024 00:00:57 -0700
> "David G. Johnston"  wrote:
>
> > I will need to make this tweak and probably a couple others to my own
> > suggestions in 12 hours or so.
> >
>
>
And here is my v2.

Notably I choose to introduce the verbiage "soft error" and then define in
the ON_ERROR clause the specific soft error that matters here - "invalid
input syntax".

I also note the log message behavior when ignore mode is chosen.  I haven't
confirmed that it is accurate but that is readily tweaked if approved of.

David J.
From 2d656fe0a69ea1349472d88794c18f8a2e2d37e9 Mon Sep 17 00:00:00 2001
From: "David G. Johnston" 
Date: Thu, 25 Jan 2024 23:35:44 -0700
Subject: [PATCH] v2 Improve ON_ERROR verbiage in COPY documentation

---
 doc/src/sgml/ref/copy.sgml | 48 +++---
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 21a5c4a052..38ce0b0a4c 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -90,6 +90,13 @@ COPY { table_name [ ( pg_stat_progress_copy view. See
  for details.
   
+
+  
+By default, COPY will fail if it encounters an error
+during processing. For use cases where a best-effort attempt at loading
+the entire file is desired, the ON_ERROR clause can
+be used to specify some other behavior.
+  
  
 
  
@@ -378,17 +385,23 @@ COPY { table_name [ ( ON_ERROR
 
  
-  Specifies which 
-  error_action to perform when there is malformed data in the input.
-  Currently, only stop (default) and ignore
-  values are supported.
-  If the stop value is specified,
-  COPY stops operation at the first error.
-  If the ignore value is specified,
-  COPY skips malformed data and continues copying data.
-  The option is allowed only in COPY FROM.
-  Only stop value is allowed when
-  using binary format.
+  Specifies which how to behave when encountering a soft error.
+  An error_action value of
+  stop means fail the command, while
+  ignore means discard the input row and continue with the next one.
+  The default is stop
+ 
+ 
+  The ignore option is applicable only for COPY FROM
+  when the FORMAT is text or csv.
+ 
+ 
+  The only relevant soft error is "invalid input syntax", which manifests when attempting
+  to create a column value from the text input.
+ 
+ 
+  An INFO level context message containing the ignored row count is
+  emitted at the end of the COPY FROM if at least one row was discarded.
  
 

@@ -576,15 +589,12 @@ COPY count

 

-COPY stops operation at the first error when
-ON_ERROR is not specified. This
-should not lead to problems in the event of a COPY
-TO, but the target table will already have received
-earlier rows in a COPY FROM. These rows will not
-be visible or accessible, but they still occupy disk space. This might
+The COPY FROM command physically inserts input rows
+into the table as it progresses.  If the command fails these rows are
+left in a deleted state, still occupying disk space. This might
 amount to a considerable amount of wasted disk space if the failure
-happened well into a large copy operation. You might wish to invoke
-VACUUM to recover the wasted space.
+happened well into a large copy operation. VACUUM
+should be used to recover the wasted space.

 

-- 
2.34.1



Re: Small fix on COPY ON_ERROR document

2024-01-26 Thread Yugo NAGATA
On Fri, 26 Jan 2024 00:00:57 -0700
"David G. Johnston"  wrote:

> On Thursday, January 25, 2024, Yugo NAGATA  wrote:
> 
> >
> > Maybe, we can separate the sentese to two, for example:
> >
> >   COPY stops operation at the first error. (The exception is if the error
> >   is due to data type incompatibility and a value other than stop is
> > specified
> >   to the ON_ERROR option.)
> >
> 
> I’d lean more toward saying:
> 
> Copy from can be instructed to ignore errors that arise from casting input
> data to the data types of the target columns by setting the on_error option
> to ignore.  Skipping the entire input row in the process.
> 
> The last part is because another way to ignore would be to set null values
> for those columns.

That makes sense. Is is a bit ambiguous to just say "skips malformed data";
it might be unclear for users if the data in the problematic column is skipped
(NULL is set) or the entire row is skipped. Also, setting null values for those
columns could be a future feature of ON_ERROR option.
> 
> That a command stops on an error is the assumed behavior throughout the
> system, writing “copy stops operation at the first error.” just seems
> really unnecessary.

I think we need to describe this default behavior explicitly somewhere,
as you suggested in the other post [1].

[1] 
https://www.postgresql.org/message-id/CAKFQuwZJZ6uaS2B7qpL2FJzWBsnDdzgtbsW3pH9zuT6vC3fH%2Bg%40mail.gmail.com

Regards,
Yugo Nagata

> I will need to make this tweak and probably a couple others to my own
> suggestions in 12 hours or so.
> 
> David J.


-- 
Yugo NAGATA 




Re: Small fix on COPY ON_ERROR document

2024-01-26 Thread Yugo NAGATA
On Thu, 25 Jan 2024 23:33:22 -0700
"David G. Johnston"  wrote:

> On Thu, Jan 25, 2024 at 10:40 PM Yugo NAGATA  wrote:
> 
> > On Fri, 26 Jan 2024 13:59:09 +0900
> > Masahiko Sawada  wrote:
> >
> > > On Fri, Jan 26, 2024 at 11:28 AM Yugo NAGATA 
> > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I found that the documentation of COPY ON_ERROR said
> > > > COPY stops operation at the first error when
> > > > "ON_ERROR is not specified.", but it also stop when
> > > > ON_ERROR is specified to the default value, "stop".
> > > >
> >
> >
> I'm finding the current wording surrounding ON_ERROR to not fit in with the
> rest of the page.  I've tried to improve things by being a bit more
> invasive.

Introducing the ON_ERROR option changed the behavior of COPY about error
handling to some extent, so it might be good to rewrite a bit more parts
of the documentation as you suggest.

> This first hunk updates the description of the COPY command to include
> describing the purpose of the ON_ERROR clause.
> 
> I too am concerned about the word "parsed" here, and "malformed" in the
> more detailed descriptions; this would need to be modified to better
> reflect the circumstances under which ignore happens.

I think "errors due to data type incompatibility" would be nice to describe
the errors to be ignored by ON_ERROR, as used in the NOTICE message output.
 
> diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
> index 21a5c4a052..5fe4c9f747 100644
> --- a/doc/src/sgml/ref/copy.sgml
> +++ b/doc/src/sgml/ref/copy.sgml
> @@ -90,6 +90,14 @@ COPY {  class="parameter">table_name [ (   in the pg_stat_progress_copy view. See
>   for details.
>
> +
> +  
> +By default, COPY will abort if it encounters errors.
> +For non-binary format COPY FROM the user can specify
> +to instead ignore input rows that cannot be parsed by specifying
> +the ignore option to the ON_ERROR
> +clause.
> +  
>   
> 

> 
> The following was, IMO, too much text about something not to worry about,
> COPY TO and ignore mode.  The point is dead tuples on error and running
> vacuum.  It doesn't really even matter what caused the error, though if the
> error was even before rows started to be imported then obviously there
> would be no dead tuples.
> 
> Oh, and the word "first" really isn't adding anything here that we cannot
> reasonably leave to common sense, IMO.  We either ignore all errors or stop
> on the first one.  There isn't a stop mode that is capable of bypassing
> errors and the ignore mode doesn't have a "first n" option so one assumes
> all errors are ignored.
> 
> @@ -576,15 +583,12 @@ COPY  class="parameter">count
> 
> 
> 
> -COPY stops operation at the first error when
> -ON_ERROR is not specified. This
> -should not lead to problems in the event of a COPY
> -TO, but the target table will already have received
> -earlier rows in a COPY FROM. These rows will not
> -be visible or accessible, but they still occupy disk space. This might
> -amount to a considerable amount of wasted disk space if the failure
> -happened well into a large copy operation. You might wish to invoke
> -VACUUM to recover the wasted space.
> +A failed COPY FROM command will have performed
> +physical insertion of all rows prior to the malformed one.

As you said that it does not matter what caused the error, I don't think
we have to use "malformed" here. Instead, we could say "we will have
performed physical insertion of rows before the failure."


> +While these rows will not be visible or accessible, they still occupy
> +disk space. This might amount to a considerable amount of wasted disk
> +space if the failure happened well into a large copy operation.
> +VACUUM should be used to recover the wasted space.
> 
> 
> 
> 
> David J.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: Small fix on COPY ON_ERROR document

2024-01-25 Thread David G. Johnston
On Thursday, January 25, 2024, Yugo NAGATA  wrote:

>
> Maybe, we can separate the sentese to two, for example:
>
>   COPY stops operation at the first error. (The exception is if the error
>   is due to data type incompatibility and a value other than stop is
> specified
>   to the ON_ERROR option.)
>

I’d lean more toward saying:

Copy from can be instructed to ignore errors that arise from casting input
data to the data types of the target columns by setting the on_error option
to ignore.  Skipping the entire input row in the process.

The last part is because another way to ignore would be to set null values
for those columns.

That a command stops on an error is the assumed behavior throughout the
system, writing “copy stops operation at the first error.” just seems
really unnecessary.

I will need to make this tweak and probably a couple others to my own
suggestions in 12 hours or so.

David J.


Re: Small fix on COPY ON_ERROR document

2024-01-25 Thread Yugo NAGATA
On Fri, 26 Jan 2024 15:26:55 +0900
Masahiko Sawada  wrote:

> On Fri, Jan 26, 2024 at 2:40 PM Yugo NAGATA  wrote:
> >
> > On Fri, 26 Jan 2024 13:59:09 +0900
> > Masahiko Sawada  wrote:
> >
> > > On Fri, Jan 26, 2024 at 11:28 AM Yugo NAGATA  wrote:
> > > >
> > > > Hi,
> > > >
> > > > I found that the documentation of COPY ON_ERROR said
> > > > COPY stops operation at the first error when
> > > > "ON_ERROR is not specified.", but it also stop when
> > > > ON_ERROR is specified to the default value, "stop".
> > > >
> > > > I attached a very small patch to fix this just for
> > > > making the description more accurate.
> > >
> > > Thank you for the patch!
> > >
> > > +1 to fix it.
> > >
> > > -ON_ERROR is not specified. This
> > > -should not lead to problems in the event of a COPY
> > > +ON_ERROR is not specified or 
> > > stop.
> > > +This should not lead to problems in the event of a COPY
> > >
> > > How about the followings for consistency with the description of the
> > > ON_ERROR option?
> > >
> > > COPY stops operation at the first error if the stop value is specified
> > > to the ON_ERROR option. This should not lead to ...
> >
> > Thank you for you review. However, after posting the previous patch,
> > I noticed that I overlooked ON_ERROR works only for errors due to data
> > type incompatibility (that is mentioned as "malformed data" in the
> > ON_ERROR description, though).
> 
> Right.
> 
> >
> > So, how about rewriting this like:
> >
> >  COPY stops operation at the first error unless the error is due to data
> >  type incompatibility and a value other than stop is specified to the
> >  ON_ERROR option.
> 
> Hmm, this sentence seems not very readable to me, especially the "COPY
> stops ... unless ... a value other than stop is specified ..." part. I
> think we can simplify it:

I can agree with your opinion on the readability, but...

> COPY stops operation at the first data type incompatibility error if
> the stop value is specified to the ON_ERROR option.

This statement doesn't explain COPY also stops when an error other than
data type incompatibility (e.g. constrain violations) occurs.

Maybe, we can separate the sentese to two, for example:

  COPY stops operation at the first error. (The exception is if the error
  is due to data type incompatibility and a value other than stop is specified
  to the ON_ERROR option.)

Regards,
Yugo Nagata

> Regards,
> 
> -- 
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com


-- 
Yugo NAGATA 




Re: Small fix on COPY ON_ERROR document

2024-01-25 Thread David G. Johnston
On Thu, Jan 25, 2024 at 10:40 PM Yugo NAGATA  wrote:

> On Fri, 26 Jan 2024 13:59:09 +0900
> Masahiko Sawada  wrote:
>
> > On Fri, Jan 26, 2024 at 11:28 AM Yugo NAGATA 
> wrote:
> > >
> > > Hi,
> > >
> > > I found that the documentation of COPY ON_ERROR said
> > > COPY stops operation at the first error when
> > > "ON_ERROR is not specified.", but it also stop when
> > > ON_ERROR is specified to the default value, "stop".
> > >
>
>
I'm finding the current wording surrounding ON_ERROR to not fit in with the
rest of the page.  I've tried to improve things by being a bit more
invasive.

This first hunk updates the description of the COPY command to include
describing the purpose of the ON_ERROR clause.

I too am concerned about the word "parsed" here, and "malformed" in the
more detailed descriptions; this would need to be modified to better
reflect the circumstances under which ignore happens.

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 21a5c4a052..5fe4c9f747 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -90,6 +90,14 @@ COPY { table_name [ ( pg_stat_progress_copy view. See
  for details.
   
+
+  
+By default, COPY will abort if it encounters errors.
+For non-binary format COPY FROM the user can specify
+to instead ignore input rows that cannot be parsed by specifying
+the ignore option to the ON_ERROR
+clause.
+  
  

  

The use of the word "Currently" stood out to me when reading the next
hunk.  We always document current reality and don't call out that fact.  We
do not imply some future feature may change how this all works.  On a
related note, right now we have stop and ignore, which are basically enums
just like we have for format (csv, text, binary).  Unlike format, though,
this option requires single quotes.  Why is that?  I did keep with not
specifying the enum in the main syntax block since format doesn't as well;
though writing { stop | ignore } seemed quite appealing.

The rest of the page indicates what the default is in a sentence by itself,
not as a parenthetical next to the option.

The command limitations seemed worthy of a separate paragraph, though it
repeats the description which isn't great.  I'm going to sleep on this one.

@@ -379,17 +387,16 @@ COPY { table_name [ ( 
  
   Specifies which 
-  error_action to perform when there is malformed data
in the input.
-  Currently, only stop (default) and
ignore
-  values are supported.
-  If the stop value is specified,
-  COPY stops operation at the first error.
-  If the ignore value is specified,
-  COPY skips malformed data and continues copying
data.
-  The option is allowed only in COPY FROM.
-  Only stop value is allowed when
-  using binary format.
+  error_action to perform when encountering a malformed
input row:
+  stop or ignore.
+  A value of stop means fail the command, while
+  ignore means discard the input row and continue
with the next one.
+  The default is stop
  
+ 
+  The ignore option is applicable only for COPY FROM
+  when the FORMAT is text
+  or csv.
+
 


The following was, IMO, too much text about something not to worry about,
COPY TO and ignore mode.  The point is dead tuples on error and running
vacuum.  It doesn't really even matter what caused the error, though if the
error was even before rows started to be imported then obviously there
would be no dead tuples.

Oh, and the word "first" really isn't adding anything here that we cannot
reasonably leave to common sense, IMO.  We either ignore all errors or stop
on the first one.  There isn't a stop mode that is capable of bypassing
errors and the ignore mode doesn't have a "first n" option so one assumes
all errors are ignored.

@@ -576,15 +583,12 @@ COPY count



-COPY stops operation at the first error when
-ON_ERROR is not specified. This
-should not lead to problems in the event of a COPY
-TO, but the target table will already have received
-earlier rows in a COPY FROM. These rows will not
-be visible or accessible, but they still occupy disk space. This might
-amount to a considerable amount of wasted disk space if the failure
-happened well into a large copy operation. You might wish to invoke
-VACUUM to recover the wasted space.
+A failed COPY FROM command will have performed
+physical insertion of all rows prior to the malformed one.
+While these rows will not be visible or accessible, they still occupy
+disk space. This might amount to a considerable amount of wasted disk
+space if the failure happened well into a large copy operation.
+VACUUM should be used to recover the wasted space.




David J.


Re: Small fix on COPY ON_ERROR document

2024-01-25 Thread Masahiko Sawada
On Fri, Jan 26, 2024 at 2:40 PM Yugo NAGATA  wrote:
>
> On Fri, 26 Jan 2024 13:59:09 +0900
> Masahiko Sawada  wrote:
>
> > On Fri, Jan 26, 2024 at 11:28 AM Yugo NAGATA  wrote:
> > >
> > > Hi,
> > >
> > > I found that the documentation of COPY ON_ERROR said
> > > COPY stops operation at the first error when
> > > "ON_ERROR is not specified.", but it also stop when
> > > ON_ERROR is specified to the default value, "stop".
> > >
> > > I attached a very small patch to fix this just for
> > > making the description more accurate.
> >
> > Thank you for the patch!
> >
> > +1 to fix it.
> >
> > -ON_ERROR is not specified. This
> > -should not lead to problems in the event of a COPY
> > +ON_ERROR is not specified or 
> > stop.
> > +This should not lead to problems in the event of a COPY
> >
> > How about the followings for consistency with the description of the
> > ON_ERROR option?
> >
> > COPY stops operation at the first error if the stop value is specified
> > to the ON_ERROR option. This should not lead to ...
>
> Thank you for you review. However, after posting the previous patch,
> I noticed that I overlooked ON_ERROR works only for errors due to data
> type incompatibility (that is mentioned as "malformed data" in the
> ON_ERROR description, though).

Right.

>
> So, how about rewriting this like:
>
>  COPY stops operation at the first error unless the error is due to data
>  type incompatibility and a value other than stop is specified to the
>  ON_ERROR option.

Hmm, this sentence seems not very readable to me, especially the "COPY
stops ... unless ... a value other than stop is specified ..." part. I
think we can simplify it:

COPY stops operation at the first data type incompatibility error if
the stop value is specified to the ON_ERROR option.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Small fix on COPY ON_ERROR document

2024-01-25 Thread Yugo NAGATA
On Fri, 26 Jan 2024 13:59:09 +0900
Masahiko Sawada  wrote:

> On Fri, Jan 26, 2024 at 11:28 AM Yugo NAGATA  wrote:
> >
> > Hi,
> >
> > I found that the documentation of COPY ON_ERROR said
> > COPY stops operation at the first error when
> > "ON_ERROR is not specified.", but it also stop when
> > ON_ERROR is specified to the default value, "stop".
> >
> > I attached a very small patch to fix this just for
> > making the description more accurate.
> 
> Thank you for the patch!
> 
> +1 to fix it.
> 
> -ON_ERROR is not specified. This
> -should not lead to problems in the event of a COPY
> +ON_ERROR is not specified or stop.
> +This should not lead to problems in the event of a COPY
> 
> How about the followings for consistency with the description of the
> ON_ERROR option?
> 
> COPY stops operation at the first error if the stop value is specified
> to the ON_ERROR option. This should not lead to ...

Thank you for you review. However, after posting the previous patch, 
I noticed that I overlooked ON_ERROR works only for errors due to data
type incompatibility (that is mentioned as "malformed data" in the 
ON_ERROR description, though). 

So, how about rewriting this like:

 COPY stops operation at the first error unless the error is due to data
 type incompatibility and a value other than stop is specified to the
 ON_ERROR option.

I attached the updated patch.

Regards,
Yugo Nagata

> 
> Regards,
> 
> -- 
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com


-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 21a5c4a052..3676bf0e87 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -576,9 +576,10 @@ COPY count

 

-COPY stops operation at the first error when
-ON_ERROR is not specified. This
-should not lead to problems in the event of a COPY
+COPY stops operation at the first error unless the error
+is due to data type incompatibility and a value other than
+stop is specified to the ON_ERROR option.
+This should not lead to problems in the event of a COPY
 TO, but the target table will already have received
 earlier rows in a COPY FROM. These rows will not
 be visible or accessible, but they still occupy disk space. This might


Re: Small fix on COPY ON_ERROR document

2024-01-25 Thread Masahiko Sawada
On Fri, Jan 26, 2024 at 11:28 AM Yugo NAGATA  wrote:
>
> Hi,
>
> I found that the documentation of COPY ON_ERROR said
> COPY stops operation at the first error when
> "ON_ERROR is not specified.", but it also stop when
> ON_ERROR is specified to the default value, "stop".
>
> I attached a very small patch to fix this just for
> making the description more accurate.

Thank you for the patch!

+1 to fix it.

-ON_ERROR is not specified. This
-should not lead to problems in the event of a COPY
+ON_ERROR is not specified or stop.
+This should not lead to problems in the event of a COPY

How about the followings for consistency with the description of the
ON_ERROR option?

COPY stops operation at the first error if the stop value is specified
to the ON_ERROR option. This should not lead to ...

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com