Re: Fixing backslash dot for COPY FROM...CSV
Bruce Momjian writes: > On Sun, Apr 7, 2024 at 12:00:25AM +0200, Daniel Verite wrote: >> Agreed we don't want to document that, but also why doesn't \. in the >> contents represent just a dot (as opposed to being an error), >> just like \a is a? > I looked into this and started to realize that \. is the only copy > backslash command where we define the behavior only alone at the > beginning of a line, and not in other circumstances. The \a example > above suggests \. should be period in all other cases, as suggested, but > I don't know the ramifications if that. Here's the problem: if some client-side code thinks it's okay to quote "." as "\.", what is likely to happen when it's presented a data value consisting only of "."? It could very easily fall into the trap of producing an end-of-data marker. If we get rid of the whole concept of end-of-data markers, then it'd be totally reasonable to accept "\." as ".". But as long as we still have end-of-data markers, I think it's unwise to allow "\." to appear as anything but an end-of-data marker. It'd just add camouflage to the booby trap. regards, tom lane
Re: Fixing backslash dot for COPY FROM...CSV
On Sun, Apr 7, 2024 at 12:00:25AM +0200, Daniel Verite wrote: > Tom Lane wrote: > > > This is sufficiently weird that I'm starting to come around to > > Daniel's original proposal that we just drop the server's recognition > > of \. altogether (which would allow removal of some dozens of lines of > > complicated and now known-buggy code) > > FWIW my plan was to not change anything in the TEXT mode, > but I wasn't aware it had this issue that you found when > \. is not in a line by itself. > > > Alternatively, we could fix it so that \. at the end of a line draws > > "end-of-copy marker corrupt" > > which would at least make things consistent, but I'm not sure that has > > any great advantage. I surely don't want to document the current > > behavioral details as being the right thing that we're going to keep > > doing. > > Agreed we don't want to document that, but also why doesn't \. in the > contents represent just a dot (as opposed to being an error), > just like \a is a? I looked into this and started to realize that \. is the only copy backslash command where we define the behavior only alone at the beginning of a line, and not in other circumstances. The \a example above suggests \. should be period in all other cases, as suggested, but I don't know the ramifications if that. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Fixing backslash dot for COPY FROM...CSV
Tom Lane wrote: > This is sufficiently weird that I'm starting to come around to > Daniel's original proposal that we just drop the server's recognition > of \. altogether (which would allow removal of some dozens of lines of > complicated and now known-buggy code) FWIW my plan was to not change anything in the TEXT mode, but I wasn't aware it had this issue that you found when \. is not in a line by itself. > Alternatively, we could fix it so that \. at the end of a line draws > "end-of-copy marker corrupt" > which would at least make things consistent, but I'm not sure that has > any great advantage. I surely don't want to document the current > behavioral details as being the right thing that we're going to keep > doing. Agreed we don't want to document that, but also why doesn't \. in the contents represent just a dot (as opposed to being an error), just like \a is a? I mean if eofdata contains foobar\a foobaz\aother then we get after import: f1 -- foobara foobazaother (2 rows) Reading the current doc on the text format, I can't see why importing: foobar\. foobar\.other is not supposed to produce f1 -- foobar. foobaz.other (2 rows) I see these rules in [1] about backslash: #1. "End of data can be represented by a single line containing just backslash-period (\.)." foobar\. and foobar\.other do not match that so #1 does not describe how they're interpreted. #2. "Backslash characters (\) can be used in the COPY data to quote data characters that might otherwise be taken as row or column delimiters." Dot is not a column delimiter (it's forbidden anyway), so #2 does not apply. #3. "In particular, the following characters must be preceded by a backslash if they appear as part of a column value: backslash itself, newline, carriage return, and the current delimiter character" Dot is not in that list so #3 does not apply. #4. "The following special backslash sequences are recognized by COPY FROM:" (followed by the table with \b \f, ...,) Dot is not mentioned. #5. "Any other backslashed character that is not mentioned in the above table will be taken to represent itself" Here we say that backslash dot represents a dot (unless other rules apply) foobar\. => foobar. foobar\.other => foobar.other #6. "However, beware of adding backslashes unnecessarily, since that might accidentally produce a string matching the end-of-data marker (\.) or the null string (\N by default)." So we *recommend* not to use \. but as I understand it, the warning with the EOD marker is about accidentally creating a line that matches #1, that is, \. alone on a line. #7 "These strings will be recognized before any other backslash processing is done." TBH I don't understand what #7 implies. The order in backslash processing looks like an implementation detail that should not matter in understanding the format? Considering this, it seems to me that #5 says that backslash-dot represents a dot unless #1 applies, and the other #2 #3 #4 #6 #7 rules do not state anything that would contradict that. [1] https://www.postgresql.org/docs/current/sql-copy.html Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Fixing backslash dot for COPY FROM...CSV
I wrote: > So the current behavior is that \. that is on the end of a line, > but is not the whole line, is silently discarded and we keep going. > All versions throw "end-of-copy marker corrupt" if there is > something after \. on the same line. > This is sufficiently weird that I'm starting to come around to > Daniel's original proposal that we just drop the server's recognition > of \. altogether (which would allow removal of some dozens of lines of > complicated and now known-buggy code). I experimented with that and soon ran into a nasty roadblock: it breaks dump/restore, because pg_dump includes a "\." line after COPY data whether or not it really needs one. Worse, that's implemented by including the "\." line into the archive format, so that existing dump files contain it. Getting rid of it would require an archive format version bump, plus some hackery to allow removal of the line when reading old dump files. While that's surely doable with enough effort, it's not the kind of thing to be undertaking with less than 2 days to feature freeze. Not to mention that I'm not sure we have consensus to do it at all. More fun stuff: PQgetline actually invents a "\." line when it sees server end-of-copy, and we tell users of that function to check for that not an out-of-band return value to detect EOF. It looks like we have no callers of that in the core distro, but do we want to deprecate it completely? So I feel like we need to put this patch on the shelf for the moment and come back to it early in v18. Although it seems reasonably clear what to do on the CSV side of things, it's very much less clear what to do about text-format handling of EOD markers, and I don't want to change one of those things in v17 and the other in v18. Also it seems like there are more dependencies on "\." than we realized. There could be an argument for applying just the psql change now, to remove its unnecessary sending of "\.". That won't break anything and it would give us at least one year's leg up on compatibility issues. regards, tom lane
Re: Fixing backslash dot for COPY FROM...CSV
After some more poking at this topic, I realize that there is already very strange and undocumented behavior for backslash-dot even in non-CSV mode. Create a file like this: $ cat eofdata foobar foobaz\. more \. yet more and try importing it with COPY: regression=# create table eofdata(f1 text); CREATE TABLE regression=# copy eofdata from '/home/tgl/pgsql/eofdata'; COPY 2 regression=# table eofdata; f1 foobar foobaz (2 rows) That's what you get in 9.0 and earlier versions, and it's already not-as-documented, because we claim that only \. alone on a line is an EOF marker; we certainly don't suggest that what's in front of it will be taken as valid data. However, somebody broke it some more in 9.1, because 9.1 up to HEAD produce this result: regression=# create table eofdata(f1 text); CREATE TABLE regression=# copy eofdata from '/home/tgl/pgsql/eofdata'; COPY 3 regression=# table eofdata; f1 foobar foobaz more (3 rows) So the current behavior is that \. that is on the end of a line, but is not the whole line, is silently discarded and we keep going. All versions throw "end-of-copy marker corrupt" if there is something after \. on the same line. This is sufficiently weird that I'm starting to come around to Daniel's original proposal that we just drop the server's recognition of \. altogether (which would allow removal of some dozens of lines of complicated and now known-buggy code). Alternatively, we could fix it so that \. at the end of a line draws "end-of-copy marker corrupt", which would at least make things consistent, but I'm not sure that has any great advantage. I surely don't want to document the current behavioral details as being the right thing that we're going to keep doing. Thoughts? regards, tom lane
Re: Fixing backslash dot for COPY FROM...CSV
"Daniel Verite" writes: > Tom Lane wrote: >> Not sure what to do here. One idea is to install just the psql-side >> fix, which should break nothing now that version-2 protocol is dead, >> and then wait a few years before introducing the server-side change. >> That seems kind of sad though. > Wouldn't backpatching solve this? No, it'd just reduce the surface area a bit. People on less-than- the-latest-minor-release would still have the issue. In any case back-patching further than v14 would be a nonstarter, because we didn't remove protocol v2 support till then. However, the analogy to "\d commands might fail against a newer server" reduces my level of concern quite a lot: it's hard to draw much of a line separating that kind of issue from "inline COPY CSV will fail against a newer server". It's not like such failures won't be obvious and fairly easy to diagnose. regards, tom lane
Re: Fixing backslash dot for COPY FROM...CSV
Tom Lane wrote: > Not sure what to do here. One idea is to install just the psql-side > fix, which should break nothing now that version-2 protocol is dead, > and then wait a few years before introducing the server-side change. > That seems kind of sad though. Wouldn't backpatching solve this? Then only the users who don't apply the minor updates would have non-matching server and psql. Initially I though that obsoleting the v2 protocol was a recent move, but reading older messages from the list I've got the impression that it was more or less in the pipeline since way before version 10. Also one of the cases the patch fixes, the one when imported data are silently truncated at the point of \., is quite nasty IMO. I can imagine an app where user-supplied data would be appended row-by-row into a CSV file, and would be processed periodically by batch. Under some conditions, in particular if newlines in the first column are allowed, a malevolent user could submit a \. sequence to cause the batch to miss the rest of the data without any error being raised. [1] https://www.postgresql.org/message-id/11648.1403147417%40sss.pgh.pa.us Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Fixing backslash dot for COPY FROM...CSV
I wrote: > So this means that the patch introduces a rather serious cross-version > compatibility problem. I doubt we can consider inlined CSV data to be > a niche case that few people use; but it will fail every time if your > psql is older than your server. On third thought, that may not be as bad as I was thinking. We don't blink at the idea that an older psql's \d commands may malfunction with a newer server, and I think most users have internalized the idea that they want psql >= server. If the patch created an incompatibility with that direction, it'd be a problem, but I don't think it does. regards, tom lane
Re: Fixing backslash dot for COPY FROM...CSV
"Daniel Verite" writes: > Tom Lane wrote: >> I've looked over this patch and I generally agree that this is a >> reasonable solution. > Thanks for reviewing this! While testing this, I tried running the tests with an updated server and non-updated psql, and not only did the new test case fail, but so did a bunch of existing ones. That's because existing psql will send the trailing "\." of inlined data to the server, and the updated server will now think that's data if it's in CSV mode. So this means that the patch introduces a rather serious cross-version compatibility problem. I doubt we can consider inlined CSV data to be a niche case that few people use; but it will fail every time if your psql is older than your server. Not sure what to do here. One idea is to install just the psql-side fix, which should break nothing now that version-2 protocol is dead, and then wait a few years before introducing the server-side change. That seems kind of sad though. An argument for not waiting is that psql may not be the only client that needs this behavioral adjustment, and if so there's going to be breakage anyway when we change the server; so we might as well get it over with. regards, tom lane
Re: Fixing backslash dot for COPY FROM...CSV
Tom Lane wrote: > I've looked over this patch and I generally agree that this is a > reasonable solution. Thanks for reviewing this! > I'm also wondering why the patch adds a test for > "PQprotocolVersion(conn) >= 3" in handleCopyIn. I've removed this in the attached update. > I concur with Robert's doubts about some of the doc changes though. > In particular, since we're not changing the behavior for non-CSV > mode, we shouldn't remove any statements that apply to non-CSV mode. I don't quite understand the issues with the doc changes. Let me detail the changes. The first change is under CSV Format so it does no concern non-CSV modes. --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -761,11 +761,7 @@ COPY count format, \., the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a \. data value appearing as a lone entry on a line is automatically -quoted on output, and on input, if quoted, is not interpreted as the -end-of-data marker. If you are loading a file created by another -application that has a single unquoted column and might have a -value of \., you might need to quote that value in the -input file. +quoted on output. The part about quoting output is kept because the code still does that. About this bit: "and on input, if quoted, is not interpreted as the end-of-data marker." So the patch removes that part. The reason is \. is now not interpreted as EOD in both cases, quoted or unquoted, conforming to spec. Previously, what the reader should have understood by "if quoted, ..." is that it implies "if not quoted, then .\ will be interpreted as EOD even though that behavior does not conform to the CSV spec". If we documented the new behavior, it would be something like: when quoted, it works as expected, and when unquoted, it works as expected too. Isn't it simpler not to say anything? About the next sentence: "If you are loading a file created by another application that has a single unquoted column and might have a value of \., you might need to quote that value in the input file." It's removed as well because it's not longer necessary to do that. The other hunk is in psql doc: --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g destination, because all data must pass through the client/server connection. For large amounts of data the SQL command might be preferable. -Also, because of this pass-through method, \copy -... from in CSV mode will erroneously -treat a \. data value alone on a line as an -end-of-input marker. That behavior no longer happens, so this gets removed as well. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite From 5a16b81f10e47fe1ac3842abd34e8bef7e639e26 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= Date: Fri, 5 Apr 2024 14:22:49 +0200 Subject: [PATCH v4] Support backslash-dot on a line by itself as valid data in COPY FROM...CSV --- doc/src/sgml/ref/copy.sgml | 6 +-- doc/src/sgml/ref/psql-ref.sgml | 4 -- src/backend/commands/copyfromparse.c | 57 +++- src/bin/psql/copy.c | 30 ++- src/test/regress/expected/copy.out | 18 + src/test/regress/sql/copy.sql| 12 ++ 6 files changed, 65 insertions(+), 62 deletions(-) diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index 33ce7c4ea6..a63f073c1b 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -814,11 +814,7 @@ COPY count format, \., the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a \. data value appearing as a lone entry on a line is automatically -quoted on output, and on input, if quoted, is not interpreted as the -end-of-data marker. If you are loading a file created by another -application that has a single unquoted column and might have a -value of \., you might need to quote that value in the -input file. +quoted on output. diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index cc7d797159..d94e3cacfc 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g destination, because all data must pass through the client/server connection. For large amounts of data the SQL command might be preferable. -Also, because of this pass-through method, \copy -... from in CSV mode will erroneously -treat a \. data value alone on a line as an -end-of-input marker. diff --git
Re: Fixing backslash dot for COPY FROM...CSV
"Daniel Verite" writes: > Robert Haas wrote: >> Those links unfortunately seem not to be entirely specific to this >> issue. Other, related things seem to be discussed there, and it's not >> obvious that everyone agrees on what to do, or really that anyone >> agrees on what to do. > The proposed patch addresses these cases by making the sequence > \. non-special in CSV (in fact backslash itself is a normal character in > CSV). > It does not address the cases when the data is embedded after > the COPY command or typed interactively in psql, since these cases > require an explicit end-of-data marker, and CSV does not have > the concept of an end-of-data marker. I've looked over this patch and I generally agree that this is a reasonable solution. While it's barely possible that somebody out there is depending on the current behavior of \. in CSV mode, it seems considerably more likely that people who run into it would consider it a bug. In any case, the patch isn't proposed for back-patch, and as cross-version incompatibilities go this seems like a pretty small one. I concur with Robert's doubts about some of the doc changes though. In particular, since we're not changing the behavior for non-CSV mode, we shouldn't remove any statements that apply to non-CSV mode. I'm also wondering why the patch adds a test for "PQprotocolVersion(conn) >= 3" in handleCopyIn. As already noted upthread, we ripped out all support for protocol 2.0 some time ago, so that sure looks like dead code. regards, tom lane
Re: Fixing backslash dot for COPY FROM...CSV
Robert Haas wrote: > Those links unfortunately seem not to be entirely specific to this > issue. Other, related things seem to be discussed there, and it's not > obvious that everyone agrees on what to do, or really that anyone > agrees on what to do. The best link that I found for this exact issue > is > https://www.postgresql.org/message-id/e1tdnvq-0001ju...@wrigleys.postgresql.org > but the thread isn't very conclusive and is so old that any > conclusions reached then might no longer be considered valid today. To refresh the problem statement, 4 cases that need fixing as of HEAD can be distinguished: #1. copy csv from file, single column, no quoting involved. COPY will stop at \. and ignore the rest of the file without any error or warning. $ cat >/tmp/file.csv
Re: Fixing backslash dot for COPY FROM...CSV
Robert Haas wrote: > Part of my hesitancy, I suppose, is that I don't > understand why we even have this strange convention of making \. > terminate the input in the first place -- I mean, why wouldn't that be > done in some kind of out-of-band way, rather than including a special > marker in the data? The v3 protocol added the out-of-band method, but the v2 protocol did not have it, and as far as I understand, this is the reason why CopyReadLineText() must interpret \. as an end-of-data marker. The v2 protocol was removed in pg14 https://www.postgresql.org/docs/release/14.0/ Remove server and libpq support for the version 2 wire protocol (Heikki Linnakangas) This was last used as the default in PostgreSQL 7.3 (released in 2002). Also I hadnt' noticed this before, but the current doc has this mention that is relevant to this patch: https://www.postgresql.org/docs/current/protocol-changes.html "Summary of Changes since Protocol 2.0" COPY data is now encapsulated into CopyData and CopyDone messages. There is a well-defined way to recover from errors during COPY. The special “\.” last line is not needed anymore, and is not sent during COPY OUT. (It is still recognized as a terminator during COPY IN, but its use is deprecated and will eventually be removed.) What the present patch does is essentially, for the server-side part, stop recognizing "\." as as terminator, like this paragraph says, but it does that for CSV only, not for TEXT. > Hmm. Looking at the rest of the patch, it seems like you're removing > the logic that prevents us from interpreting > > \. lksdghksdhgjskdghjs > > as an end-of-file while in CSV mode. But I would have thought based on > what problem you're trying to fix that you would have wanted to keep > that logic and further restrict it so that it only applies when not > within a quoted string. > > Maybe I'm misunderstanding what bug you're trying to fix? The fix is that \. is no longer recognized as special in CSV, whether alone on a line or not, and whether in a quoted section or not. It's always interpreted as data, like it would have been in the first place, I imagine, if the v2 protocol could have handled it. This is why the patch consists mostly of removing code and simplifying comments. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Fixing backslash dot for COPY FROM...CSV
On Mon, Dec 18, 2023 at 3:36 PM Daniel Verite wrote: > PFA a patch that attempts to fix the bug that \. on a line > by itself is handled incorrectly by COPY FROM ... CSV. > This issue has been discussed several times previously, > for instance in [1] and [2], and mentioned in the > doc for \copy in commit 42d3125. Those links unfortunately seem not to be entirely specific to this issue. Other, related things seem to be discussed there, and it's not obvious that everyone agrees on what to do, or really that anyone agrees on what to do. The best link that I found for this exact issue is https://www.postgresql.org/message-id/e1tdnvq-0001ju...@wrigleys.postgresql.org but the thread isn't very conclusive and is so old that any conclusions reached then might no longer be considered valid today. And I guess the reason I mention is that, supposing your patch were technically perfect in every way, would everyone agree that it ought to be committed? If Alice is a user with a CSV file that might contain \. on a line by itself within a quoted CSV field, then Alice is currently sad because she can't necessarily load all of her CSV files. The patch would fix that, and make her happy. On the other hand, if Bob is a user with a CSV-ish file that definitely doesn't contain \. on a line by itself within a quoted CSV field but might have been truncated in the middle of a quoted field, maybe Bob will be sad if this patch gets committed, because he will no longer be able to append \n\.\n to whatever junk he's got in the file and let the server sort out whether to throw an error. I have to admit that it seems more likely that there are people in the world with Alice's problem rather than people in the world with Bob's problem. We'd probably make more people happy with the change than sad. But it is a definitional change, I think, and that's a little scary, and maybe somebody will think that's a reason why we should change nothing here. Part of my hesitancy, I suppose, is that I don't understand why we even have this strange convention of making \. terminate the input in the first place -- I mean, why wouldn't that be done in some kind of out-of-band way, rather than including a special marker in the data? I can't help feeling a bit nervous about this first documentation hunk: --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -761,11 +761,7 @@ COPY count format, \., the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a \. data value appearing as a lone entry on a line is automatically -quoted on output, and on input, if quoted, is not interpreted as the -end-of-data marker. If you are loading a file created by another -application that has a single unquoted column and might have a -value of \., you might need to quote that value in the -input file. +quoted on output. It doesn't feel right to me to just replace all of this text with nothing. That leaves us documenting only the behavior on output, whereas the previous text documents both the output behavior (we quote it) and the input behavior (it has to be quoted to avoid being taken as the EOF marker). /* - * In CSV mode, we only recognize \. alone on a line. This is because - * \. is a valid CSV data value. + * In CSV mode, backslash is a normal character. */ -if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line)) +if (c == '\\' && !cstate->opts.csv_mode) I don't think that the update comment accurately describes the behavior. If I understand correctly what you're trying to fix, I'd expect the new behavior to be that we only recognize \. alone on a line and even then only if we're not inside a quoting string, but that's not what the revised comment says. Instead, it claims that backslash is just a normal character, but if that were true, the whole if-statement wouldn't exist at all, since its purpose is to provide special-handling for backslashes -- and indeed the patch does not change that, since the if-statement is still looking for a backslash and doing something special if one is found. Hmm. Looking at the rest of the patch, it seems like you're removing the logic that prevents us from interpreting \. lksdghksdhgjskdghjs as an end-of-file while in CSV mode. But I would have thought based on what problem you're trying to fix that you would have wanted to keep that logic and further restrict it so that it only applies when not within a quoted string. Maybe I'm misunderstanding what bug you're trying to fix? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fixing backslash dot for COPY FROM...CSV
Hi, The CI patch tester fails on this patch, because it has a label at the end of a C block, which I'm learning is a C23 feature that happens to be supported by gcc 11 [1], but is not portable. PFA an update fixing this, plus removing an obsolete chunk in the COPY documentation that v2 left out. [1] https://gcc.gnu.org/onlinedocs/gcc/Mixed-Labels-and-Declarations.html Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite From 98f38aff440ad683aa1bd7a30fd960f1d0101191 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= Date: Sun, 31 Dec 2023 14:47:05 +0100 Subject: [PATCH v4] Support backslash-dot on a line by itself as valid data in COPY FROM...CSV --- doc/src/sgml/ref/copy.sgml | 6 +-- doc/src/sgml/ref/psql-ref.sgml | 4 -- src/backend/commands/copyfromparse.c | 57 +++- src/bin/psql/copy.c | 32 +++- src/test/regress/expected/copy.out | 18 + src/test/regress/sql/copy.sql| 12 ++ 6 files changed, 67 insertions(+), 62 deletions(-) diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index 18ecc69c33..e719053e3d 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -761,11 +761,7 @@ COPY count format, \., the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a \. data value appearing as a lone entry on a line is automatically -quoted on output, and on input, if quoted, is not interpreted as the -end-of-data marker. If you are loading a file created by another -application that has a single unquoted column and might have a -value of \., you might need to quote that value in the -input file. +quoted on output. diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index cc7d797159..d94e3cacfc 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g destination, because all data must pass through the client/server connection. For large amounts of data the SQL command might be preferable. -Also, because of this pass-through method, \copy -... from in CSV mode will erroneously -treat a \. data value alone on a line as an -end-of-input marker. diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index f553734582..67e046d450 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -136,14 +136,6 @@ if (1) \ } \ } else ((void) 0) -/* Undo any read-ahead and jump out of the block. */ -#define NO_END_OF_COPY_GOTO \ -if (1) \ -{ \ - input_buf_ptr = prev_raw_ptr + 1; \ - goto not_end_of_copy; \ -} else ((void) 0) - /* NOTE: there's a copy of this in copyto.c */ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0"; @@ -1137,7 +1129,6 @@ CopyReadLineText(CopyFromState cstate) boolresult = false; /* CSV variables */ - boolfirst_char_in_line = true; boolin_quote = false, last_was_esc = false; charquotec = '\0'; @@ -1335,10 +1326,9 @@ CopyReadLineText(CopyFromState cstate) } /* -* In CSV mode, we only recognize \. alone on a line. This is because -* \. is a valid CSV data value. +* In CSV mode, backslash is a normal character. */ - if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line)) + if (c == '\\' && !cstate->opts.csv_mode) { charc2; @@ -1371,21 +1361,15 @@ CopyReadLineText(CopyFromState cstate) if (c2 == '\n') { - if (!cstate->opts.csv_mode) - ereport(ERROR, - (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), - errmsg("end-of-copy marker does not match previous newline style"))); - else - NO_END_OF_COPY_GOTO; + ereport(ERROR, + (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), + errmsg("end-of-copy marker does not match previous newline style"))); } else if
Re: Fixing backslash dot for COPY FROM...CSV
On Fri, 22 Dec 2023 at 01:17, Daniel Verite wrote: > > vignesh C wrote: > > > Thanks for the updated patch, any reason why this is handled only in csv. > > postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out'; > > COPY 1 > > postgres=# select * from test1; > > c1 > > --- > > line1 > > (1 row) > > I believe it's safer to not change anything to the normal "non-csv" > text mode. > The current doc says that \. will not be taken as data in this format. > From https://www.postgresql.org/docs/current/sql-copy.html : > >Any other backslashed character that is not mentioned in the above >table will be taken to represent itself. However, beware of adding >backslashes unnecessarily, since that might accidentally produce a >string matching the end-of-data marker (\.) or the null string (\N >by default). These strings will be recognized before any other >backslash processing is done. Thanks for the clarification. Then let's keep it as you have implemented. Regards, Vignesh
Re: Fixing backslash dot for COPY FROM...CSV
vignesh C wrote: > Thanks for the updated patch, any reason why this is handled only in csv. > postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out'; > COPY 1 > postgres=# select * from test1; > c1 > --- > line1 > (1 row) I believe it's safer to not change anything to the normal "non-csv" text mode. The current doc says that \. will not be taken as data in this format. From https://www.postgresql.org/docs/current/sql-copy.html : Any other backslashed character that is not mentioned in the above table will be taken to represent itself. However, beware of adding backslashes unnecessarily, since that might accidentally produce a string matching the end-of-data marker (\.) or the null string (\N by default). These strings will be recognized before any other backslash processing is done. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: Fixing backslash dot for COPY FROM...CSV
On Tue, 19 Dec 2023 at 16:57, Daniel Verite wrote: > > vignesh C wrote: > > > I noticed that these tests are passing without applying patch too: > > > +insert into copytest2(test) values('line1'), ('\.'), ('line2'); > > +copy (select test from copytest2 order by test collate "C") to :'filename' > > csv; > > +-- get the data back in with copy > > +truncate copytest2; > > +copy copytest2(test) from :'filename' csv; > > +select test from copytest2 order by test collate "C"; > > > > I was not sure if this was intentional. Can we add a test which fails > > in HEAD and passes with the patch applied. > > Thanks for checking this out. > Indeed, that was not intentional. I've been using static files > in my tests and forgot that if the data was produced with > COPY OUT, it would quote backslash-dot so that COPY IN could > reload it without problem. > > PFA an updated version that uses \qecho to produce the > data instead of COPY OUT. This test on unpatched HEAD > shows that copytest2 is missing 2 rows after COPY IN. Thanks for the updated patch, any reason why this is handled only in csv. postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out'; COPY 1 postgres=# select * from test1; c1 --- line1 (1 row) postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out' csv; COPY 1 postgres=# select * from test1; c1 --- line1 \. line2 (3 rows) As the documentation at [1] says: An end-of-data marker is not necessary when reading from a file, since the end of file serves perfectly well; it is needed only when copying data to or from client applications using pre-3.0 client protocol. [1] - https://www.postgresql.org/docs/devel/sql-copy.html Regards, Vignesh
Re: Fixing backslash dot for COPY FROM...CSV
vignesh C wrote: > I noticed that these tests are passing without applying patch too: > +insert into copytest2(test) values('line1'), ('\.'), ('line2'); > +copy (select test from copytest2 order by test collate "C") to :'filename' > csv; > +-- get the data back in with copy > +truncate copytest2; > +copy copytest2(test) from :'filename' csv; > +select test from copytest2 order by test collate "C"; > > I was not sure if this was intentional. Can we add a test which fails > in HEAD and passes with the patch applied. Thanks for checking this out. Indeed, that was not intentional. I've been using static files in my tests and forgot that if the data was produced with COPY OUT, it would quote backslash-dot so that COPY IN could reload it without problem. PFA an updated version that uses \qecho to produce the data instead of COPY OUT. This test on unpatched HEAD shows that copytest2 is missing 2 rows after COPY IN. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite From 0137963134ac88b009a8e93d6a39cabfaefe43f5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= Date: Tue, 19 Dec 2023 11:49:35 +0100 Subject: [PATCH v2] Support backslash-dot on a line by itself as valid data in COPY FROM...CSV --- doc/src/sgml/ref/psql-ref.sgml | 4 src/backend/commands/copyfromparse.c | 13 ++- src/bin/psql/copy.c | 32 src/test/regress/expected/copy.out | 18 src/test/regress/sql/copy.sql| 12 +++ 5 files changed, 55 insertions(+), 24 deletions(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index cc7d797159..d94e3cacfc 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g destination, because all data must pass through the client/server connection. For large amounts of data the SQL command might be preferable. -Also, because of this pass-through method, \copy -... from in CSV mode will erroneously -treat a \. data value alone on a line as an -end-of-input marker. diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index f553734582..b4c291b25b 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -1137,7 +1137,6 @@ CopyReadLineText(CopyFromState cstate) boolresult = false; /* CSV variables */ - boolfirst_char_in_line = true; boolin_quote = false, last_was_esc = false; charquotec = '\0'; @@ -1335,10 +1334,9 @@ CopyReadLineText(CopyFromState cstate) } /* -* In CSV mode, we only recognize \. alone on a line. This is because -* \. is a valid CSV data value. +* In CSV mode, \. is a valid CSV data value anywhere in the line. */ - if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line)) + if (c == '\\' && !cstate->opts.csv_mode) { charc2; @@ -1442,14 +1440,7 @@ CopyReadLineText(CopyFromState cstate) } } - /* -* This label is for CSV cases where \. appears at the start of a -* line, but there is more text after it, meaning it was a data value. -* We are more strict for \. in CSV mode because \. could be a data -* value, while in non-CSV mode, \. cannot be a data value. -*/ not_end_of_copy: - first_char_in_line = false; } /* end of outer loop */ /* diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c index dbbbdb8898..f31a7acbb6 100644 --- a/src/bin/psql/copy.c +++ b/src/bin/psql/copy.c @@ -620,20 +620,34 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) /* current line is done? */ if (buf[buflen - 1] == '\n') { - /* check for EOF marker, but not on a partial line */ - if (at_line_begin) + /* +* When at the beginning of the line, check for EOF marker. +* If the marker is found and the data is inlined, +* we must stop at this point. If not, the \. line can be +* sent to the server, and we let it decide whether +
Re: Fixing backslash dot for COPY FROM...CSV
On Tue, 19 Dec 2023 at 02:06, Daniel Verite wrote: > > Hi, > > PFA a patch that attempts to fix the bug that \. on a line > by itself is handled incorrectly by COPY FROM ... CSV. > This issue has been discussed several times previously, > for instance in [1] and [2], and mentioned in the > doc for \copy in commit 42d3125. > > There's one case that works today: when > the line is part of a multi-line quoted section, > and the data is read from a file, not from the client. > In other situations, an error is raised or the data is cut at > the point of \. without an error. > > The patch addresses that issue in the server and in psql, > except for the case of inlined data, where \. cannot be > both valid data and an EOF marker at the same time, so > it keeps treating it as an EOF marker. I noticed that these tests are passing without applying patch too: +++ b/src/test/regress/sql/copy.sql @@ -38,6 +38,17 @@ copy copytest2 from :'filename' csv quote escape E'\\'; select * from copytest except select * from copytest2; +--- test unquoted .\ as data inside CSV + +truncate copytest2; + +insert into copytest2(test) values('line1'), ('\.'), ('line2'); +copy (select test from copytest2 order by test collate "C") to :'filename' csv; +-- get the data back in with copy +truncate copytest2; +copy copytest2(test) from :'filename' csv; +select test from copytest2 order by test collate "C"; I was not sure if this was intentional. Can we add a test which fails in HEAD and passes with the patch applied. Regards, Vignesh
Fixing backslash dot for COPY FROM...CSV
Hi, PFA a patch that attempts to fix the bug that \. on a line by itself is handled incorrectly by COPY FROM ... CSV. This issue has been discussed several times previously, for instance in [1] and [2], and mentioned in the doc for \copy in commit 42d3125. There's one case that works today: when the line is part of a multi-line quoted section, and the data is read from a file, not from the client. In other situations, an error is raised or the data is cut at the point of \. without an error. The patch addresses that issue in the server and in psql, except for the case of inlined data, where \. cannot be both valid data and an EOF marker at the same time, so it keeps treating it as an EOF marker. [1] https://www.postgresql.org/message-id/10e3eff6-eb04-4b3f-aeb4-b920192b9...@manitou-mail.org [2] https://www.postgresql.org/message-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05%40app.fastmail.com Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite From 868b1e065cf714f315bab65129fd05a1d60fc81b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Daniel=20V=C3=A9rit=C3=A9?= Date: Mon, 18 Dec 2023 20:47:02 +0100 Subject: [PATCH v1] Support backslash-dot on a line by itself as valid data in COPY FROM...CSV --- doc/src/sgml/ref/psql-ref.sgml | 4 src/backend/commands/copyfromparse.c | 13 ++- src/bin/psql/copy.c | 32 src/test/regress/expected/copy.out | 15 + src/test/regress/sql/copy.sql| 11 ++ 5 files changed, 51 insertions(+), 24 deletions(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index cc7d797159..d94e3cacfc 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g destination, because all data must pass through the client/server connection. For large amounts of data the SQL command might be preferable. -Also, because of this pass-through method, \copy -... from in CSV mode will erroneously -treat a \. data value alone on a line as an -end-of-input marker. diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index f553734582..b4c291b25b 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -1137,7 +1137,6 @@ CopyReadLineText(CopyFromState cstate) boolresult = false; /* CSV variables */ - boolfirst_char_in_line = true; boolin_quote = false, last_was_esc = false; charquotec = '\0'; @@ -1335,10 +1334,9 @@ CopyReadLineText(CopyFromState cstate) } /* -* In CSV mode, we only recognize \. alone on a line. This is because -* \. is a valid CSV data value. +* In CSV mode, \. is a valid CSV data value anywhere in the line. */ - if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line)) + if (c == '\\' && !cstate->opts.csv_mode) { charc2; @@ -1442,14 +1440,7 @@ CopyReadLineText(CopyFromState cstate) } } - /* -* This label is for CSV cases where \. appears at the start of a -* line, but there is more text after it, meaning it was a data value. -* We are more strict for \. in CSV mode because \. could be a data -* value, while in non-CSV mode, \. cannot be a data value. -*/ not_end_of_copy: - first_char_in_line = false; } /* end of outer loop */ /* diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c index dbbbdb8898..f31a7acbb6 100644 --- a/src/bin/psql/copy.c +++ b/src/bin/psql/copy.c @@ -620,20 +620,34 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res) /* current line is done? */ if (buf[buflen - 1] == '\n') { - /* check for EOF marker, but not on a partial line */ - if (at_line_begin) + /* +* When at the beginning of the line, check for EOF marker. +* If the marker is found and the data is inlined, +* we must stop at this point. If not, the \. line can be +* sent to the server, and we let it decide whether +* it's an