Re: Fixing backslash dot for COPY FROM...CSV

2024-04-06 Thread Tom Lane
Bruce Momjian  writes:
> On Sun, Apr  7, 2024 at 12:00:25AM +0200, Daniel Verite wrote:
>> Agreed we don't want to document that, but also why doesn't \. in the
>> contents represent just a dot  (as opposed to being an error),
>> just like \a is a?

> I looked into this and started to realize that \. is the only copy
> backslash command where we define the behavior only alone at the
> beginning of a line, and not in other circumstances.  The \a example
> above suggests \. should be period in all other cases, as suggested, but
> I don't know the ramifications if that.

Here's the problem: if some client-side code thinks it's okay to
quote "." as "\.", what is likely to happen when it's presented
a data value consisting only of "."?  It could very easily fall
into the trap of producing an end-of-data marker.

If we get rid of the whole concept of end-of-data markers, then
it'd be totally reasonable to accept "\." as ".".  But as long
as we still have end-of-data markers, I think it's unwise to allow
"\." to appear as anything but an end-of-data marker.  It'd just
add camouflage to the booby trap.

regards, tom lane




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-06 Thread Bruce Momjian
On Sun, Apr  7, 2024 at 12:00:25AM +0200, Daniel Verite wrote:
>   Tom Lane wrote:
> 
> > This is sufficiently weird that I'm starting to come around to
> > Daniel's original proposal that we just drop the server's recognition
> > of \. altogether (which would allow removal of some dozens of lines of
> > complicated and now known-buggy code)
> 
> FWIW my plan was to not change anything in the TEXT mode,
> but I wasn't aware it had this issue that you found when
> \. is not in a line by itself.
> 
> >  Alternatively, we could fix it so that \. at the end of a line draws
> > "end-of-copy marker corrupt"
> > which would at least make things consistent, but I'm not sure that has
> > any great advantage.  I surely don't want to document the current
> > behavioral details as being the right thing that we're going to keep
> > doing.
> 
> Agreed we don't want to document that, but also why doesn't \. in the
> contents represent just a dot  (as opposed to being an error),
> just like \a is a?

I looked into this and started to realize that \. is the only copy
backslash command where we define the behavior only alone at the
beginning of a line, and not in other circumstances.  The \a example
above suggests \. should be period in all other cases, as suggested, but
I don't know the ramifications if that.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-06 Thread Daniel Verite
Tom Lane wrote:

> This is sufficiently weird that I'm starting to come around to
> Daniel's original proposal that we just drop the server's recognition
> of \. altogether (which would allow removal of some dozens of lines of
> complicated and now known-buggy code)

FWIW my plan was to not change anything in the TEXT mode,
but I wasn't aware it had this issue that you found when
\. is not in a line by itself.

>  Alternatively, we could fix it so that \. at the end of a line draws
> "end-of-copy marker corrupt"
> which would at least make things consistent, but I'm not sure that has
> any great advantage.  I surely don't want to document the current
> behavioral details as being the right thing that we're going to keep
> doing.

Agreed we don't want to document that, but also why doesn't \. in the
contents represent just a dot  (as opposed to being an error),
just like \a is a?

I mean if eofdata contains

  foobar\a
  foobaz\aother

then we get after import:
  f1  
--
 foobara
 foobazaother
(2 rows)

Reading the current doc on the text format, I can't see why
importing:

  foobar\.
  foobar\.other

is not supposed to produce
  f1  
--
 foobar.
 foobaz.other
(2 rows)


I see these rules in [1] about backslash:

#1. 
  "End of data can be represented by a single line containing just
   backslash-period (\.)."

foobar\. and foobar\.other do not match that so #1 does not describe
how they're interpreted.

#2.
  "Backslash characters (\) can be used in the COPY data to quote data
  characters that might otherwise be taken as row or column
  delimiters."

Dot is not a column delimiter (it's forbidden anyway), so #2 does
not apply.

#3.
  "In particular, the following characters must be preceded by a
  backslash if they appear as part of a column value: backslash itself,
  newline, carriage return, and the current delimiter character"

Dot is not in that list so #3 does not apply.

#4.
  "The following special backslash sequences are recognized by COPY
  FROM:" (followed by the table with \b \f, ...,)

Dot is not mentioned.

#5.
  "Any other backslashed character that is not mentioned in the above
  table will be taken to represent itself"

Here we say that backslash dot represents a dot (unless other
rules apply)

  foobar\. => foobar. 
  foobar\.other => foobar.other

#6.
  "However, beware of adding backslashes unnecessarily, since that
   might accidentally produce a string matching the end-of-data marker
   (\.) or the null string (\N by default)."

So we *recommend* not to use \. but as I understand it, the warning
with the EOD marker is about accidentally creating a line that matches #1,
that is, \. alone on a line.

#7
  "These strings will be recognized before any other backslash
  processing is done."

TBH I don't understand what #7 implies. The order in backslash
processing looks like an implementation detail that should not
matter in understanding the format?


Considering this, it seems to me that #5 says that
backslash-dot represents a dot unless #1 applies, and the
other #2 #3 #4 #6 #7 rules do not state anything that would
contradict that.


[1] https://www.postgresql.org/docs/current/sql-copy.html


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-06 Thread Tom Lane
I wrote:
> So the current behavior is that \. that is on the end of a line,
> but is not the whole line, is silently discarded and we keep going.

> All versions throw "end-of-copy marker corrupt" if there is
> something after \. on the same line.

> This is sufficiently weird that I'm starting to come around to
> Daniel's original proposal that we just drop the server's recognition
> of \. altogether (which would allow removal of some dozens of lines of
> complicated and now known-buggy code).

I experimented with that and soon ran into a nasty roadblock: it
breaks dump/restore, because pg_dump includes a "\." line after
COPY data whether or not it really needs one.  Worse, that's
implemented by including the "\." line into the archive format,
so that existing dump files contain it.  Getting rid of it would
require an archive format version bump, plus some hackery to allow
removal of the line when reading old dump files.

While that's surely doable with enough effort, it's not the kind
of thing to be undertaking with less than 2 days to feature freeze.
Not to mention that I'm not sure we have consensus to do it at all.

More fun stuff: PQgetline actually invents a "\." line when it
sees server end-of-copy, and we tell users of that function to
check for that not an out-of-band return value to detect EOF.
It looks like we have no callers of that in the core distro,
but do we want to deprecate it completely?

So I feel like we need to put this patch on the shelf for the moment
and come back to it early in v18.  Although it seems reasonably clear
what to do on the CSV side of things, it's very much less clear what
to do about text-format handling of EOD markers, and I don't want to
change one of those things in v17 and the other in v18.  Also it
seems like there are more dependencies on "\." than we realized.

There could be an argument for applying just the psql change now,
to remove its unnecessary sending of "\.".  That won't break
anything and it would give us at least one year's leg up on
compatibility issues.

regards, tom lane




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-05 Thread Tom Lane
After some more poking at this topic, I realize that there is already
very strange and undocumented behavior for backslash-dot even in
non-CSV mode.  Create a file like this:

$ cat eofdata
foobar
foobaz\.
more
\.
yet more

and try importing it with COPY:

regression=# create table eofdata(f1 text);
CREATE TABLE
regression=# copy eofdata from '/home/tgl/pgsql/eofdata';
COPY 2
regression=# table eofdata;
   f1   

 foobar
 foobaz
(2 rows)

That's what you get in 9.0 and earlier versions, and it's already
not-as-documented, because we claim that only \. alone on a line is an
EOF marker; we certainly don't suggest that what's in front of it will
be taken as valid data.  However, somebody broke it some more in 9.1,
because 9.1 up to HEAD produce this result:

regression=# create table eofdata(f1 text);
CREATE TABLE
regression=# copy eofdata from '/home/tgl/pgsql/eofdata';
COPY 3
regression=# table eofdata;
   f1   

 foobar
 foobaz
 more
(3 rows)

So the current behavior is that \. that is on the end of a line,
but is not the whole line, is silently discarded and we keep going.

All versions throw "end-of-copy marker corrupt" if there is
something after \. on the same line.

This is sufficiently weird that I'm starting to come around to
Daniel's original proposal that we just drop the server's recognition
of \. altogether (which would allow removal of some dozens of lines of
complicated and now known-buggy code).  Alternatively, we could fix it
so that \. at the end of a line draws "end-of-copy marker corrupt",
which would at least make things consistent, but I'm not sure that has
any great advantage.  I surely don't want to document the current
behavioral details as being the right thing that we're going to keep
doing.

Thoughts?

regards, tom lane




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-05 Thread Tom Lane
"Daniel Verite"  writes:
>   Tom Lane wrote:
>> Not sure what to do here.  One idea is to install just the psql-side
>> fix, which should break nothing now that version-2 protocol is dead,
>> and then wait a few years before introducing the server-side change.
>> That seems kind of sad though.

> Wouldn't backpatching solve this?

No, it'd just reduce the surface area a bit.  People on less-than-
the-latest-minor-release would still have the issue.  In any case
back-patching further than v14 would be a nonstarter, because we
didn't remove protocol v2 support till then.

However, the analogy to "\d commands might fail against a newer
server" reduces my level of concern quite a lot: it's hard to
draw much of a line separating that kind of issue from "inline
COPY CSV will fail against a newer server".  It's not like such
failures won't be obvious and fairly easy to diagnose.

regards, tom lane




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-05 Thread Daniel Verite
Tom Lane wrote:

> Not sure what to do here.  One idea is to install just the psql-side
> fix, which should break nothing now that version-2 protocol is dead,
> and then wait a few years before introducing the server-side change.
> That seems kind of sad though.

Wouldn't backpatching solve this?  Then only the users who don't
apply the minor updates would have non-matching server and psql.
Initially I though that obsoleting the v2 protocol was a recent
move, but reading older messages from the list I've got the
impression that it was more or less in the pipeline since way
before version 10.

Also one of the cases the patch fixes, the one when imported
data are silently truncated at the point of \., is quite nasty
IMO.
I can imagine an app where user-supplied data would be
appended row-by-row into a CSV file, and would be processed
periodically by batch. Under some conditions, in particular
if newlines in the first column are allowed, a malevolent user could
submit a \. sequence  to cause the batch to miss the rest of the data
without any error being raised.


[1]
https://www.postgresql.org/message-id/11648.1403147417%40sss.pgh.pa.us

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-05 Thread Tom Lane
I wrote:
> So this means that the patch introduces a rather serious cross-version
> compatibility problem.  I doubt we can consider inlined CSV data to be
> a niche case that few people use; but it will fail every time if your
> psql is older than your server.

On third thought, that may not be as bad as I was thinking.
We don't blink at the idea that an older psql's \d commands may
malfunction with a newer server, and I think most users have
internalized the idea that they want psql >= server.  If the
patch created an incompatibility with that direction, it'd be
a problem, but I don't think it does.

regards, tom lane




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-05 Thread Tom Lane
"Daniel Verite"  writes:
>   Tom Lane wrote:
>> I've looked over this patch and I generally agree that this is a
>> reasonable solution.

> Thanks for reviewing this!

While testing this, I tried running the tests with an updated server
and non-updated psql, and not only did the new test case fail, but
so did a bunch of existing ones.  That's because existing psql will
send the trailing "\." of inlined data to the server, and the updated
server will now think that's data if it's in CSV mode.

So this means that the patch introduces a rather serious cross-version
compatibility problem.  I doubt we can consider inlined CSV data to be
a niche case that few people use; but it will fail every time if your
psql is older than your server.

Not sure what to do here.  One idea is to install just the psql-side
fix, which should break nothing now that version-2 protocol is dead,
and then wait a few years before introducing the server-side change.
That seems kind of sad though.

An argument for not waiting is that psql may not be the only client
that needs this behavioral adjustment, and if so there's going to
be breakage anyway when we change the server; so we might as well
get it over with.

regards, tom lane




Re: Fixing backslash dot for COPY FROM...CSV

2024-04-05 Thread Daniel Verite
Tom Lane wrote:

> I've looked over this patch and I generally agree that this is a
> reasonable solution.

Thanks for reviewing this!

> I'm also wondering why the patch adds a test for
> "PQprotocolVersion(conn) >= 3" in handleCopyIn. 

I've removed this in the attached update.

> I concur with Robert's doubts about some of the doc changes though.
> In particular, since we're not changing the behavior for non-CSV
> mode, we shouldn't remove any statements that apply to non-CSV mode.

I don't quite understand the issues with the doc changes. Let me
detail the changes.

The first change is under 
  
   CSV Format
so it does no concern non-CSV modes.

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -761,11 +761,7 @@ COPY count
format, \., the end-of-data marker, could also appear
as a data value.  To avoid any misinterpretation, a \.
data value appearing as a lone entry on a line is automatically
-quoted on output, and on input, if quoted, is not interpreted as the
-end-of-data marker.  If you are loading a file created by another
-application that has a single unquoted column and might have a
-value of \., you might need to quote that value in
the
-input file.
+quoted on output.
   


The part about quoting output is kept because the code still does that.

About this bit:  
 "and on input, if quoted, is not interpreted as the end-of-data marker."

So the patch removes that part. The reason is \. is now not interpreted as
EOD in both cases, quoted or unquoted, conforming to spec.
Previously, what the reader should have understood by "if quoted, ..."
is that it implies "if not quoted, then .\ will be interpreted as EOD
even though that behavior does not conform to the CSV spec".
If we documented the new behavior, it would be something like:
when quoted, it works as expected, and when unquoted, it works as
expected too. Isn't it simpler not to say anything?

About the next sentence:
  "If you are loading a file created by another application
  that has a single unquoted column and might have a value of \., you
  might need to quote that value in the input file."

It's removed as well because it's not longer necessary
to do that.

The other hunk is in psql doc:

--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value'
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.

That behavior no longer happens, so this gets removed as well.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 5a16b81f10e47fe1ac3842abd34e8bef7e639e26 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Fri, 5 Apr 2024 14:22:49 +0200
Subject: [PATCH v4] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/copy.sgml   |  6 +--
 doc/src/sgml/ref/psql-ref.sgml   |  4 --
 src/backend/commands/copyfromparse.c | 57 +++-
 src/bin/psql/copy.c  | 30 ++-
 src/test/regress/expected/copy.out   | 18 +
 src/test/regress/sql/copy.sql| 12 ++
 6 files changed, 65 insertions(+), 62 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 33ce7c4ea6..a63f073c1b 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -814,11 +814,7 @@ COPY count
 format, \., the end-of-data marker, could also appear
 as a data value.  To avoid any misinterpretation, a \.
 data value appearing as a lone entry on a line is automatically
-quoted on output, and on input, if quoted, is not interpreted as the
-end-of-data marker.  If you are loading a file created by another
-application that has a single unquoted column and might have a
-value of \., you might need to quote that value in the
-input file.
+quoted on output.

 

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..d94e3cacfc 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git 

Re: Fixing backslash dot for COPY FROM...CSV

2024-04-02 Thread Tom Lane
"Daniel Verite"  writes:
>   Robert Haas wrote:
>> Those links unfortunately seem not to be entirely specific to this
>> issue. Other, related things seem to be discussed there, and it's not
>> obvious that everyone agrees on what to do, or really that anyone
>> agrees on what to do.

> The proposed  patch addresses these cases by making the sequence
> \. non-special in CSV (in fact backslash itself is a normal character in
> CSV).
> It does not address the cases when the data is embedded after
> the COPY command or typed interactively in psql, since these cases
> require an explicit end-of-data marker, and CSV does not have
> the concept of an end-of-data marker.

I've looked over this patch and I generally agree that this is a
reasonable solution.  While it's barely possible that somebody
out there is depending on the current behavior of \. in CSV mode,
it seems considerably more likely that people who run into it
would consider it a bug.  In any case, the patch isn't proposed
for back-patch, and as cross-version incompatibilities go this
seems like a pretty small one.

I concur with Robert's doubts about some of the doc changes though.
In particular, since we're not changing the behavior for non-CSV
mode, we shouldn't remove any statements that apply to non-CSV mode.

I'm also wondering why the patch adds a test for
"PQprotocolVersion(conn) >= 3" in handleCopyIn.  As already
noted upthread, we ripped out all support for protocol 2.0
some time ago, so that sure looks like dead code.

regards, tom lane




Re: Fixing backslash dot for COPY FROM...CSV

2024-01-24 Thread Daniel Verite
Robert Haas wrote:

> Those links unfortunately seem not to be entirely specific to this
> issue. Other, related things seem to be discussed there, and it's not
> obvious that everyone agrees on what to do, or really that anyone
> agrees on what to do. The best link that I found for this exact issue
> is
> https://www.postgresql.org/message-id/e1tdnvq-0001ju...@wrigleys.postgresql.org
> but the thread isn't very conclusive and is so old that any
> conclusions reached then might no longer be considered valid today.

To refresh the problem statement, 4 cases that need fixing as
of HEAD can be distinguished:

#1. copy csv from file, single column, no quoting involved.
COPY will stop at \. and ignore the rest of the file without
any error or warning.

$ cat >/tmp/file.csv 

Re: Fixing backslash dot for COPY FROM...CSV

2024-01-16 Thread Daniel Verite
Robert Haas wrote:

> Part of my hesitancy, I suppose, is that I don't
> understand why we even have this strange convention of making \.
> terminate the input in the first place -- I mean, why wouldn't that be
> done in some kind of out-of-band way, rather than including a special
> marker in the data?

The v3 protocol added the out-of-band method, but the v2 protocol
did not have it, and as far as I understand, this is the reason why
CopyReadLineText() must interpret \. as an end-of-data marker.

The v2 protocol was removed in pg14
https://www.postgresql.org/docs/release/14.0/

  Remove server and libpq support for the version 2 wire protocol (Heikki
Linnakangas)
  This was last used as the default in PostgreSQL 7.3 (released in 2002).


Also I hadnt' noticed this before, but the current doc has this mention
that is relevant to this patch:

https://www.postgresql.org/docs/current/protocol-changes.html
"Summary of Changes since Protocol 2.0"

  COPY data is now encapsulated into CopyData and CopyDone
  messages. There is a well-defined way to recover from errors during
  COPY. The special “\.” last line is not needed anymore, and is not
  sent during COPY OUT. (It is still recognized as a terminator during
  COPY IN, but its use is deprecated and will eventually be removed.)


What the present patch does is essentially, for the server-side part,
stop recognizing "\." as as terminator, like this paragraph says, but
it does that for CSV only, not for TEXT.

> Hmm. Looking at the rest of the patch, it seems like you're removing
> the logic that prevents us from interpreting
> 
> \. lksdghksdhgjskdghjs
> 
> as an end-of-file while in CSV mode. But I would have thought based on
> what problem you're trying to fix that you would have wanted to keep
> that logic and further restrict it so that it only applies when not
> within a quoted string.
> 
> Maybe I'm misunderstanding what bug you're trying to fix?

The fix is that \. is no longer recognized as special in CSV, whether
alone on a line or not, and whether in a quoted section or not.
It's always interpreted as data, like it would have been in
the first place, I imagine, if the v2 protocol could have handled
it. This is why the patch consists mostly of removing code and
simplifying comments.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2024-01-15 Thread Robert Haas
On Mon, Dec 18, 2023 at 3:36 PM Daniel Verite  wrote:
> PFA a patch that attempts to fix the bug that \. on a line
> by itself is handled incorrectly by COPY FROM ... CSV.
> This issue has been discussed several times previously,
> for instance in [1] and [2], and mentioned in the
> doc for \copy in commit 42d3125.

Those links unfortunately seem not to be entirely specific to this
issue. Other, related things seem to be discussed there, and it's not
obvious that everyone agrees on what to do, or really that anyone
agrees on what to do. The best link that I found for this exact issue
is 
https://www.postgresql.org/message-id/e1tdnvq-0001ju...@wrigleys.postgresql.org
but the thread isn't very conclusive and is so old that any
conclusions reached then might no longer be considered valid today.

And I guess the reason I mention is that, supposing your patch were
technically perfect in every way, would everyone agree that it ought
to be committed? If Alice is a user with a CSV file that might contain
\. on a line by itself within a quoted CSV field, then Alice is
currently sad because she can't necessarily load all of her CSV files.
The patch would fix that, and make her happy. On the other hand, if
Bob is a user with a CSV-ish file that definitely doesn't contain \.
on a line by itself within a quoted CSV field but might have been
truncated in the middle of a quoted field, maybe Bob will be sad if
this patch gets committed, because he will no longer be able to append
\n\.\n to whatever junk he's got in the file and let the server sort
out whether to throw an error.

I have to admit that it seems more likely that there are people in the
world with Alice's problem rather than people in the world with Bob's
problem. We'd probably make more people happy with the change than
sad. But it is a definitional change, I think, and that's a little
scary, and maybe somebody will think that's a reason why we should
change nothing here. Part of my hesitancy, I suppose, is that I don't
understand why we even have this strange convention of making \.
terminate the input in the first place -- I mean, why wouldn't that be
done in some kind of out-of-band way, rather than including a special
marker in the data?

I can't help feeling a bit nervous about this first documentation hunk:

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -761,11 +761,7 @@ COPY count
 format, \., the end-of-data marker, could also appear
 as a data value.  To avoid any misinterpretation, a \.
 data value appearing as a lone entry on a line is automatically
-quoted on output, and on input, if quoted, is not interpreted as the
-end-of-data marker.  If you are loading a file created by another
-application that has a single unquoted column and might have a
-value of \., you might need to quote that value in the
-input file.
+quoted on output.




It doesn't feel right to me to just replace all of this text with
nothing. That leaves us documenting only the behavior on output,
whereas the previous text documents both the output behavior (we quote
it) and the input behavior (it has to be quoted to avoid being taken
as the EOF marker).

 /*
- * In CSV mode, we only recognize \. alone on a line.  This is because
- * \. is a valid CSV data value.
+ * In CSV mode, backslash is a normal character.
  */
-if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+if (c == '\\' && !cstate->opts.csv_mode)

I don't think that the update comment accurately describes the
behavior. If I understand correctly what you're trying to fix, I'd
expect the new behavior to be that we only recognize \. alone on a
line and even then only if we're not inside a quoting string, but
that's not what the revised comment says. Instead, it claims that
backslash is just a normal character, but if that were true, the whole
if-statement wouldn't exist at all, since its purpose is to provide
special-handling for backslashes -- and indeed the patch does not
change that, since the if-statement is still looking for a backslash
and doing something special if one is found.

Hmm. Looking at the rest of the patch, it seems like you're removing
the logic that prevents us from interpreting

\. lksdghksdhgjskdghjs

as an end-of-file while in CSV mode. But I would have thought based on
what problem you're trying to fix that you would have wanted to keep
that logic and further restrict it so that it only applies when not
within a quoted string.

Maybe I'm misunderstanding what bug you're trying to fix?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Fixing backslash dot for COPY FROM...CSV

2023-12-31 Thread Daniel Verite
  Hi,

The CI patch tester fails on this patch, because it has a label
at the end of a C block, which I'm learning is a C23 feature
that happens to be supported by gcc 11 [1], but is not portable.

PFA an update fixing this, plus removing an obsolete chunk
in the COPY documentation that v2 left out.


[1] https://gcc.gnu.org/onlinedocs/gcc/Mixed-Labels-and-Declarations.html


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 98f38aff440ad683aa1bd7a30fd960f1d0101191 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Sun, 31 Dec 2023 14:47:05 +0100
Subject: [PATCH v4] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/copy.sgml   |  6 +--
 doc/src/sgml/ref/psql-ref.sgml   |  4 --
 src/backend/commands/copyfromparse.c | 57 +++-
 src/bin/psql/copy.c  | 32 +++-
 src/test/regress/expected/copy.out   | 18 +
 src/test/regress/sql/copy.sql| 12 ++
 6 files changed, 67 insertions(+), 62 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69c33..e719053e3d 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -761,11 +761,7 @@ COPY count
 format, \., the end-of-data marker, could also appear
 as a data value.  To avoid any misinterpretation, a \.
 data value appearing as a lone entry on a line is automatically
-quoted on output, and on input, if quoted, is not interpreted as the
-end-of-data marker.  If you are loading a file created by another
-application that has a single unquoted column and might have a
-value of \., you might need to quote that value in the
-input file.
+quoted on output.

 

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..d94e3cacfc 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index f553734582..67e046d450 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -136,14 +136,6 @@ if (1) \
} \
 } else ((void) 0)
 
-/* Undo any read-ahead and jump out of the block. */
-#define NO_END_OF_COPY_GOTO \
-if (1) \
-{ \
-   input_buf_ptr = prev_raw_ptr + 1; \
-   goto not_end_of_copy; \
-} else ((void) 0)
-
 /* NOTE: there's a copy of this in copyto.c */
 static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
 
@@ -1137,7 +1129,6 @@ CopyReadLineText(CopyFromState cstate)
boolresult = false;
 
/* CSV variables */
-   boolfirst_char_in_line = true;
boolin_quote = false,
last_was_esc = false;
charquotec = '\0';
@@ -1335,10 +1326,9 @@ CopyReadLineText(CopyFromState cstate)
}
 
/*
-* In CSV mode, we only recognize \. alone on a line.  This is 
because
-* \. is a valid CSV data value.
+* In CSV mode, backslash is a normal character.
 */
-   if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+   if (c == '\\' && !cstate->opts.csv_mode)
{
charc2;
 
@@ -1371,21 +1361,15 @@ CopyReadLineText(CopyFromState cstate)
 
if (c2 == '\n')
{
-   if (!cstate->opts.csv_mode)
-   ereport(ERROR,
-   
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-
errmsg("end-of-copy marker does not match previous newline style")));
-   else
-   NO_END_OF_COPY_GOTO;
+   ereport(ERROR,
+   
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
+
errmsg("end-of-copy marker does not match previous newline style")));
}
else if 

Re: Fixing backslash dot for COPY FROM...CSV

2023-12-22 Thread vignesh C
On Fri, 22 Dec 2023 at 01:17, Daniel Verite  wrote:
>
> vignesh C wrote:
>
> > Thanks for the updated patch, any reason why this is handled only in csv.
> > postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out';
> > COPY 1
> > postgres=# select * from test1;
> >  c1
> > ---
> > line1
> > (1 row)
>
> I believe it's safer to not change anything to the normal "non-csv"
> text mode.
> The current doc says that \. will not be taken as data in this format.
> From https://www.postgresql.org/docs/current/sql-copy.html :
>
>Any other backslashed character that is not mentioned in the above
>table will be taken to represent itself. However, beware of adding
>backslashes unnecessarily, since that might accidentally produce a
>string matching the end-of-data marker (\.) or the null string (\N
>by default). These strings will be recognized before any other
>backslash processing is done.

Thanks for the clarification. Then let's keep it as you have implemented.

Regards,
Vignesh




Re: Fixing backslash dot for COPY FROM...CSV

2023-12-21 Thread Daniel Verite
vignesh C wrote:

> Thanks for the updated patch, any reason why this is handled only in csv.
> postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out';
> COPY 1
> postgres=# select * from test1;
>  c1
> ---
> line1
> (1 row)

I believe it's safer to not change anything to the normal "non-csv"
text mode.
The current doc says that \. will not be taken as data in this format.
From https://www.postgresql.org/docs/current/sql-copy.html :

   Any other backslashed character that is not mentioned in the above
   table will be taken to represent itself. However, beware of adding
   backslashes unnecessarily, since that might accidentally produce a
   string matching the end-of-data marker (\.) or the null string (\N
   by default). These strings will be recognized before any other
   backslash processing is done.



Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: Fixing backslash dot for COPY FROM...CSV

2023-12-20 Thread vignesh C
On Tue, 19 Dec 2023 at 16:57, Daniel Verite  wrote:
>
> vignesh C wrote:
>
> > I noticed that these tests are passing without applying patch too:
>
> > +insert into copytest2(test) values('line1'), ('\.'), ('line2');
> > +copy (select test from copytest2 order by test collate "C") to :'filename'
> > csv;
> > +-- get the data back in with copy
> > +truncate copytest2;
> > +copy copytest2(test) from :'filename' csv;
> > +select test from copytest2 order by test collate "C";
> >
> > I was not sure if this was intentional. Can we add a test which fails
> > in HEAD and passes with the patch applied.
>
> Thanks for checking this out.
> Indeed, that was not intentional. I've been using static files
> in my tests and forgot that if the data was produced with
> COPY OUT, it would quote backslash-dot so that COPY IN could
> reload it without problem.
>
> PFA an updated version that uses \qecho to produce the
> data instead of COPY OUT. This test on unpatched HEAD
> shows that copytest2 is missing 2 rows after COPY IN.

Thanks for the updated patch, any reason why this is handled only in csv.
postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out';
COPY 1
postgres=# select * from test1;
  c1
---
 line1
(1 row)

postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out' csv;
COPY 1
postgres=# select * from test1;
  c1
---
 line1
 \.
 line2
(3 rows)

As the documentation at [1] says:
An end-of-data marker is not necessary when reading from a file, since
the end of file serves perfectly well; it is needed only when copying
data to or from client applications using pre-3.0 client protocol.

[1] - https://www.postgresql.org/docs/devel/sql-copy.html

Regards,
Vignesh




Re: Fixing backslash dot for COPY FROM...CSV

2023-12-19 Thread Daniel Verite
vignesh C wrote:

> I noticed that these tests are passing without applying patch too:

> +insert into copytest2(test) values('line1'), ('\.'), ('line2');
> +copy (select test from copytest2 order by test collate "C") to :'filename'
> csv;
> +-- get the data back in with copy
> +truncate copytest2;
> +copy copytest2(test) from :'filename' csv;
> +select test from copytest2 order by test collate "C";
> 
> I was not sure if this was intentional. Can we add a test which fails
> in HEAD and passes with the patch applied.

Thanks for checking this out.
Indeed, that was not intentional. I've been using static files
in my tests and forgot that if the data was produced with
COPY OUT, it would quote backslash-dot so that COPY IN could
reload it without problem.

PFA an updated version that uses \qecho to produce the
data instead of COPY OUT. This test on unpatched HEAD
shows that copytest2 is missing 2 rows after COPY IN.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 0137963134ac88b009a8e93d6a39cabfaefe43f5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Tue, 19 Dec 2023 11:49:35 +0100
Subject: [PATCH v2] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/psql-ref.sgml   |  4 
 src/backend/commands/copyfromparse.c | 13 ++-
 src/bin/psql/copy.c  | 32 
 src/test/regress/expected/copy.out   | 18 
 src/test/regress/sql/copy.sql| 12 +++
 5 files changed, 55 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..d94e3cacfc 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index f553734582..b4c291b25b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1137,7 +1137,6 @@ CopyReadLineText(CopyFromState cstate)
boolresult = false;
 
/* CSV variables */
-   boolfirst_char_in_line = true;
boolin_quote = false,
last_was_esc = false;
charquotec = '\0';
@@ -1335,10 +1334,9 @@ CopyReadLineText(CopyFromState cstate)
}
 
/*
-* In CSV mode, we only recognize \. alone on a line.  This is 
because
-* \. is a valid CSV data value.
+* In CSV mode, \. is a valid CSV data value anywhere in the 
line.
 */
-   if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+   if (c == '\\' && !cstate->opts.csv_mode)
{
charc2;
 
@@ -1442,14 +1440,7 @@ CopyReadLineText(CopyFromState cstate)
}
}
 
-   /*
-* This label is for CSV cases where \. appears at the start of 
a
-* line, but there is more text after it, meaning it was a data 
value.
-* We are more strict for \. in CSV mode because \. could be a 
data
-* value, while in non-CSV mode, \. cannot be a data value.
-*/
 not_end_of_copy:
-   first_char_in_line = false;
}   /* end of outer 
loop */
 
/*
diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c
index dbbbdb8898..f31a7acbb6 100644
--- a/src/bin/psql/copy.c
+++ b/src/bin/psql/copy.c
@@ -620,20 +620,34 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool 
isbinary, PGresult **res)
/* current line is done? */
if (buf[buflen - 1] == '\n')
{
-   /* check for EOF marker, but not on a 
partial line */
-   if (at_line_begin)
+   /*
+* When at the beginning of the line, 
check for EOF marker.
+* If the marker is found and the data 
is inlined,
+* we must stop at this point.  If not, 
the \. line can be
+* sent to the server, and we let it 
decide whether
+

Re: Fixing backslash dot for COPY FROM...CSV

2023-12-19 Thread vignesh C
On Tue, 19 Dec 2023 at 02:06, Daniel Verite  wrote:
>
>   Hi,
>
> PFA a patch that attempts to fix the bug that \. on a line
> by itself is handled incorrectly by COPY FROM ... CSV.
> This issue has been discussed several times previously,
> for instance in [1] and [2], and mentioned in the
> doc for \copy in commit 42d3125.
>
> There's one case that works today: when
> the line is part of a multi-line quoted section,
> and the data is read from a file, not from the client.
> In other situations, an error is raised or the data is cut at
> the point of \. without an error.
>
> The patch addresses that issue in the server and in psql,
> except for the case of inlined data, where \. cannot be
> both valid data and an EOF marker at the same time, so
> it keeps treating it as an EOF marker.

I noticed that these tests are passing without applying patch too:
+++ b/src/test/regress/sql/copy.sql
@@ -38,6 +38,17 @@ copy copytest2 from :'filename' csv quote  escape E'\\';

 select * from copytest except select * from copytest2;

+--- test unquoted .\ as data inside CSV
+
+truncate copytest2;
+
+insert into copytest2(test) values('line1'), ('\.'), ('line2');
+copy (select test from copytest2 order by test collate "C") to :'filename' csv;
+-- get the data back in with copy
+truncate copytest2;
+copy copytest2(test) from :'filename' csv;
+select test from copytest2 order by test collate "C";

I was not sure if this was intentional. Can we add a test which fails
in HEAD and passes with the patch applied.

Regards,
Vignesh




Fixing backslash dot for COPY FROM...CSV

2023-12-18 Thread Daniel Verite
  Hi,

PFA a patch that attempts to fix the bug that \. on a line
by itself is handled incorrectly by COPY FROM ... CSV.
This issue has been discussed several times previously,
for instance in [1] and [2], and mentioned in the
doc for \copy in commit 42d3125.

There's one case that works today: when
the line is part of a multi-line quoted section,
and the data is read from a file, not from the client.
In other situations, an error is raised or the data is cut at
the point of \. without an error.

The patch addresses that issue in the server and in psql,
except for the case of inlined data, where \. cannot be 
both valid data and an EOF marker at the same time, so
it keeps treating it as an EOF marker. 


[1]
https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b9...@manitou-mail.org
[2]
https://www.postgresql.org/message-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05%40app.fastmail.com


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
From 868b1e065cf714f315bab65129fd05a1d60fc81b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= 
Date: Mon, 18 Dec 2023 20:47:02 +0100
Subject: [PATCH v1] Support backslash-dot on a line by itself as valid data in
 COPY FROM...CSV

---
 doc/src/sgml/ref/psql-ref.sgml   |  4 
 src/backend/commands/copyfromparse.c | 13 ++-
 src/bin/psql/copy.c  | 32 
 src/test/regress/expected/copy.out   | 15 +
 src/test/regress/sql/copy.sql| 11 ++
 5 files changed, 51 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..d94e3cacfc 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 
'second value' \g
 destination, because all data must pass through the client/server
 connection.  For large amounts of data the SQL
 command might be preferable.
-Also, because of this pass-through method, \copy
-... from in CSV mode will erroneously
-treat a \. data value alone on a line as an
-end-of-input marker.
 
 
 
diff --git a/src/backend/commands/copyfromparse.c 
b/src/backend/commands/copyfromparse.c
index f553734582..b4c291b25b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1137,7 +1137,6 @@ CopyReadLineText(CopyFromState cstate)
boolresult = false;
 
/* CSV variables */
-   boolfirst_char_in_line = true;
boolin_quote = false,
last_was_esc = false;
charquotec = '\0';
@@ -1335,10 +1334,9 @@ CopyReadLineText(CopyFromState cstate)
}
 
/*
-* In CSV mode, we only recognize \. alone on a line.  This is 
because
-* \. is a valid CSV data value.
+* In CSV mode, \. is a valid CSV data value anywhere in the 
line.
 */
-   if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+   if (c == '\\' && !cstate->opts.csv_mode)
{
charc2;
 
@@ -1442,14 +1440,7 @@ CopyReadLineText(CopyFromState cstate)
}
}
 
-   /*
-* This label is for CSV cases where \. appears at the start of 
a
-* line, but there is more text after it, meaning it was a data 
value.
-* We are more strict for \. in CSV mode because \. could be a 
data
-* value, while in non-CSV mode, \. cannot be a data value.
-*/
 not_end_of_copy:
-   first_char_in_line = false;
}   /* end of outer 
loop */
 
/*
diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c
index dbbbdb8898..f31a7acbb6 100644
--- a/src/bin/psql/copy.c
+++ b/src/bin/psql/copy.c
@@ -620,20 +620,34 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool 
isbinary, PGresult **res)
/* current line is done? */
if (buf[buflen - 1] == '\n')
{
-   /* check for EOF marker, but not on a 
partial line */
-   if (at_line_begin)
+   /*
+* When at the beginning of the line, 
check for EOF marker.
+* If the marker is found and the data 
is inlined,
+* we must stop at this point.  If not, 
the \. line can be
+* sent to the server, and we let it 
decide whether
+* it's an