Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In "Re: confusing / inefficient "need_transcoding" handling in copy" on Mon, 16 Dec 2024 11:25:16 +0900, Michael Paquier wrote: > On Sat, Dec 14, 2024 at 04:46:57PM +0900, Michael Paquier wrote: >> Note that using "test" as table name for the tests is not a good idea, >> as this could very likely conflict with some concurrent activity. I >> would also add two RESET queries to remove the dependency to >> client_encoding once the test has no need to rely on it. No need to >> send a new patch for all that, just noticing in passing. > > I got some time to look again at this one. Applied with the tweaks > for the table name and the two RESET queries. Thanks! I'll use better table name next time. Thanks, -- kou
Re: confusing / inefficient "need_transcoding" handling in copy
On Sat, Dec 14, 2024 at 04:46:57PM +0900, Michael Paquier wrote: > Note that using "test" as table name for the tests is not a good idea, > as this could very likely conflict with some concurrent activity. I > would also add two RESET queries to remove the dependency to > client_encoding once the test has no need to rely on it. No need to > send a new patch for all that, just noticing in passing. I got some time to look again at this one. Applied with the tweaks for the table name and the two RESET queries. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
On Fri, Dec 13, 2024 at 12:27:37PM +0900, Sutou Kouhei wrote: > Oh, sorry... I attached wrong patch... > I attach the v4 patch that includes this case. Sounds fair to me as a beginning for the code paths setting need_transcoding. Note that using "test" as table name for the tests is not a good idea, as this could very likely conflict with some concurrent activity. I would also add two RESET queries to remove the dependency to client_encoding once the test has no need to rely on it. No need to send a new patch for all that, just noticing in passing. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In "Re: confusing / inefficient "need_transcoding" handling in copy" on Fri, 13 Dec 2024 12:03:45 +0900, Michael Paquier wrote: >> OK. I've added valid cases too by using LATIN1 as you >> suggested. > > I may have missed something but v3 does not use that for a valid > conversion? Oh, sorry... I attached wrong patch... I attach the v4 patch that includes this case. >> Oh! I didn't know the "XXX_1.out" feature. > > You have missed the inclusion of an alternate output, which should be > something like that to bypass the test rather than failing: > --- /dev/null > +++ b/src/test/regress/expected/copyencoding_1.out > @@ -0,0 +1,7 @@ > +-- > +-- Test cases for COPY encoding > +-- > +-- skip test if not UTF8 server encoding > +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset > +\if :skip_test > +\quit > > I guess that you have the file, forgot a `git add`. I did "git add" but I attached a wrong file... The v4 patch includes this too. Thanks, -- kou >From 65976f2357da73ea0d821b3b071f479a1650169f Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v4] Add tests for encoding for COPY FROM This adds valid cases and invalid cases. Valid cases read UTF8 character as LATIN1 character. Because LATIN1 accepts any bytes. Invalid cases read UTF8 character as EUC_JP character. The added tests use ENCODING and client_encoding. The client_encoding value is used when we don't specify ENCODING explicitly. --- src/test/regress/expected/copyencoding.out | 43 src/test/regress/expected/copyencoding_1.out | 7 +++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql| 52 4 files changed, 103 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/expected/copyencoding_1.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 000..d7ee888997c --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,43 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- Valid cases +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as LATIN1: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'LATIN1'); +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as LATIN1: No error +SET client_encoding TO LATIN1; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); +-- Invalid cases +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +-- Read UTF8 data as EUC_JP: No error +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +-- Read UTF8 data as EUC_JP: No error +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/expected/copyencoding_1.out b/src/test/regress/expected/copyencoding_1.out new file mode 100644 index 000..c754e23ff7d --- /dev/null +++ b/src/test/regress/expected/copyencoding_1.out @@ -0,0 +1,7 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 81e4222d26a..1edd9e45ebb 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # -- -t
Re: confusing / inefficient "need_transcoding" handling in copy
On Thu, Dec 12, 2024 at 03:25:41PM +0900, Sutou Kouhei wrote: > In > "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, > 10 Dec 2024 13:59:25 +0900, > Michael Paquier wrote: >> Another one would be valid conversions back and forth. For example, >> I recall that LATIN1 accepts any bytes and can apply a conversion to >> UTF-8, so we could use it and expand a bit more the proposed tests? >> Or something like that? > > OK. I've added valid cases too by using LATIN1 as you > suggested. I may have missed something but v3 does not use that for a valid conversion? >> Likely, this should be made conditional, based on the fact that the >> database needs to be able to support utf8? There are a couple of >> examples like that in the tree, based on the following SQL trick: >> SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset >> \if :skip_test >> \quit >> \endif > > Thanks. I didn't notice the portability problem. I've added > the skip trick. Indeed, thanks. > Oh! I didn't know the "XXX_1.out" feature. You have missed the inclusion of an alternate output, which should be something like that to bypass the test rather than failing: --- /dev/null +++ b/src/test/regress/expected/copyencoding_1.out @@ -0,0 +1,7 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit I guess that you have the file, forgot a `git add`. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 10 Dec 2024 13:59:25 +0900, Michael Paquier wrote: > client_encoding would be used by COPY when not specifying ENCODING > option. Perhaps more tests should be added with this value specified > by a SET client_encoding? It makes sense. I missed the case. I've added the case to the v3 patch. > Another one would be valid conversions back and forth. For example, > I recall that LATIN1 accepts any bytes and can apply a conversion to > UTF-8, so we could use it and expand a bit more the proposed tests? > Or something like that? OK. I've added valid cases too by using LATIN1 as you suggested. > This is not going to be portable across the buildfarm. Two reasons > are spotted by the CI (there may be others): > 1) For Windows, as in the following regression.diffs: > COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); > +ERROR: character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has > no equivalent in encoding "WIN1252" > 2) Second failure on Linux, with 32-bit builds: > COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); > +ERROR: conversion between UTF8 and SQL_ASCII is not supported > > Likely, this should be made conditional, based on the fact that the > database needs to be able to support utf8? There are a couple of > examples like that in the tree, based on the following SQL trick: > SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset > \if :skip_test > \quit > \endif Thanks. I didn't notice the portability problem. I've added the skip trick. > This requires an alternate output for the non-utf8 case. Oh! I didn't know the "XXX_1.out" feature. Thanks, -- kou >From 104c21bbd8d4fe063b74510d16be32372c4ac1bc Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v3] Add tests for invalid encoding for COPY FROM The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but the test specifies EUC_JP. So it's an invalid data. The added tests use ENCODING and client_encoding. The client_encoding value is used when we don't specify ENCODING explicitly. --- src/test/regress/expected/copyencoding.out | 27 +++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 30 ++ 3 files changed, 58 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 000..86fcb640313 --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,27 @@ +-- +-- Test cases for COPY encoding +-- +-- skip test if not UTF8 server encoding +SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset +\if :skip_test +\quit +\endif +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- Use ENCODING explicitly +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +-- Use client_encoding +SET client_encoding TO UTF8; +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv); +SET client_encoding TO EUC_JP; +COPY test FROM :'utf8_csv' WITH (FORMAT csv); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 81e4222d26a..1edd9e45ebb 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # -- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # -- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 000..46a467cc31e --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,30 @@ +-- +-- Test cases for COPY encoding +-- + +--
Re: confusing / inefficient "need_transcoding" handling in copy
On Fri, Dec 06, 2024 at 04:20:42PM +0900, Sutou Kouhei wrote: > (Do you think that this patch is still needed?) This thread has fallen off my radar, my apologies about that. Yes, I think that this is a good thing to expand these tests. Let's take one step at a time. I have a couple of comments. +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; client_encoding would be used by COPY when not specifying ENCODING option. Perhaps more tests should be added with this value specified by a SET client_encoding? Another one would be valid conversions back and forth. For example, I recall that LATIN1 accepts any bytes and can apply a conversion to UTF-8, so we could use it and expand a bit more the proposed tests? Or something like that? This is not going to be portable across the buildfarm. Two reasons are spotted by the CI (there may be others): 1) For Windows, as in the following regression.diffs: COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +ERROR: character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has no equivalent in encoding "WIN1252" 2) Second failure on Linux, with 32-bit builds: COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +ERROR: conversion between UTF8 and SQL_ASCII is not supported Likely, this should be made conditional, based on the fact that the database needs to be able to support utf8? There are a couple of examples like that in the tree, based on the following SQL trick: SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset \if :skip_test \quit \endif This requires an alternate output for the non-utf8 case. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
Hi Michael, ping. (Do you think that this patch is still needed?) Thanks, -- kou In <20240214.114608.2091541942684063981@clear-code.com> "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 11:46:08 +0900 (JST), Sutou Kouhei wrote: > Hi, > > In > "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, > 14 Feb 2024 06:56:16 +0900, > Michael Paquier wrote: > >> We have a couple of non-ASCII characters in the tests, but I suspect >> that this one will not be digested correctly everywhere, even if >> EUC_JP should be OK to use for the check. How about writing an >> arbitrary sequence of bytes into a temporary file that gets used for >> the COPY FROM instead? See for example how we do that with >> abs_builddir in copy.sql. > > It makes sense. How about the attached patch? > > > Thanks, > -- > kou
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 06:56:16 +0900, Michael Paquier wrote: > We have a couple of non-ASCII characters in the tests, but I suspect > that this one will not be digested correctly everywhere, even if > EUC_JP should be OK to use for the check. How about writing an > arbitrary sequence of bytes into a temporary file that gets used for > the COPY FROM instead? See for example how we do that with > abs_builddir in copy.sql. It makes sense. How about the attached patch? Thanks, -- kou >From 6eb9669f97c54f8b85fac63db40ad80664692d12 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v2] Add a test for invalid encoding for COPY FROM The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but the test specifies EUC_JP. So it's an invalid data. --- src/test/regress/expected/copyencoding.out | 13 + src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 15 +++ 3 files changed, 29 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00..32a9d918fa --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,13 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1d8a414eea..238cef28c4 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # -- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # -- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00..89e2124996 --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,15 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- + +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR + +CREATE TABLE test (t text); + +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); + +DROP TABLE test; -- 2.43.0
Re: confusing / inefficient "need_transcoding" handling in copy
On Thu, Feb 08, 2024 at 05:25:01PM +0900, Sutou Kouhei wrote: > In <20240206222445.hzq22pb2nye7r...@awork3.anarazel.de> > "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 > Feb 2024 14:24:45 -0800, > Andres Freund wrote: > >> One unfortunate issue: We don't have any tests verifying that COPY FROM >> catches encoding issues. > > How about the attached patch for it? > > +CREATE TABLE test (t text); > +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); > +こんにちは > +\. > + > +DROP TABLE test; We have a couple of non-ASCII characters in the tests, but I suspect that this one will not be digested correctly everywhere, even if EUC_JP should be OK to use for the check. How about writing an arbitrary sequence of bytes into a temporary file that gets used for the COPY FROM instead? See for example how we do that with abs_builddir in copy.sql. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, On 2024-02-09 09:36:28 +0900, Michael Paquier wrote: > On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote: > > There's no validation, just conversion. I'd suggest: > > > > "Set up encoding conversion info if the file and server encodings differ > > (see also pg_server_to_any)." > > > > Other than that, +1 > > Cool. I've used your wording and applied that on HEAD. Thanks. LGTM. Greetings, Andres Freund
Re: confusing / inefficient "need_transcoding" handling in copy
On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote: > There's no validation, just conversion. I'd suggest: > > "Set up encoding conversion info if the file and server encodings differ > (see also pg_server_to_any)." > > Other than that, +1 Cool. I've used your wording and applied that on HEAD. > BTW, I can see an optimization opportunity even if the encodings differ: > Currently, CopyAttributeOutText() calls pg_server_to_any(), and then grovels > through the string to find any characters that need to be quoted. You could > do it the other way round and handle quoting before the conversion. That has > two benefits: > > 1. You don't need the strlen() call, because you just scanned through the > string so you already know its length. > 2. You don't need to worry about 'encoding_embeds_ascii' when you operate on > the server encoding. That sounds right, still it looks like there would be cases where you'd need the strlen() call if !encoding_embeds_ascii. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
On Thu, Feb 08, 2024 at 05:29:46PM +0900, Sutou Kouhei wrote: > Oh, sorry. I missed the Michael's patch: > https://www.postgresql.org/message-id/flat/ZcR9Q9hJ8GedFSCd%40paquier.xyz#e73272b042a22befac7a95f7bcb4fb9a > > I withdraw my change. No problem. Thanks for caring about that. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In <20240208.172501.2177371292839763981@clear-code.com> "Re: confusing / inefficient "need_transcoding" handling in copy" on Thu, 08 Feb 2024 17:25:01 +0900 (JST), Sutou Kouhei wrote: > How about the following to avoid needless transcoding? Oh, sorry. I missed the Michael's patch: https://www.postgresql.org/message-id/flat/ZcR9Q9hJ8GedFSCd%40paquier.xyz#e73272b042a22befac7a95f7bcb4fb9a I withdraw my change. Thanks, -- kou
Re: confusing / inefficient "need_transcoding" handling in copy
On 08/02/2024 09:05, Michael Paquier wrote: On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote: I think the code is just very confusing - there actually *is* verification of the encoding, it just happens at a different, earlier, layer, namely in copyfromparse.c: CopyConvertBuf() which says: /* * If the file and server encoding are the same, no encoding conversion is * required. However, we still need to verify that the input is valid for * the encoding. */ And does indeed verify that. This has been switched by Heikki in f82de5c46bdf, in 2021, that has removed pg_database_encoding_max_length() in the COPY FROM case. (Adding him now in CC, in case he has any comments). Yeah, I agree the "pg_database_encoding_max_length() > 1" check in COPY TO is unnecessary. It's always been like that, but now that the COPY TO and COPY FROM cases don't share the code, it's more obvious. Let's remove it. On your patch: +* Set up encoding conversion info, validating data if server and +* client encodings are not the same (see also pg_server_to_any). There's no validation, just conversion. I'd suggest: "Set up encoding conversion info if the file and server encodings differ (see also pg_server_to_any)." Other than that, +1 That's a performance-only change, but there may be a good argument for backpatching something, perhaps? -1 for backpatching, out of principle. This is not a regression, it's always been like that. Seems innocent, but why is this different from any other performance improvement we make. BTW, I can see an optimization opportunity even if the encodings differ: Currently, CopyAttributeOutText() calls pg_server_to_any(), and then grovels through the string to find any characters that need to be quoted. You could do it the other way round and handle quoting before the conversion. That has two benefits: 1. You don't need the strlen() call, because you just scanned through the string so you already know its length. 2. You don't need to worry about 'encoding_embeds_ascii' when you operate on the server encoding. -- Heikki Linnakangas Neon (https://neon.tech)
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In <20240206222445.hzq22pb2nye7r...@awork3.anarazel.de> "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800, Andres Freund wrote: > One unfortunate issue: We don't have any tests verifying that COPY FROM > catches encoding issues. How about the attached patch for it? How about the following to avoid needless transcoding? diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index bd4710a79b..80ec26cafe 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -612,13 +612,14 @@ BeginCopyTo(ParseState *pstate, cstate->file_encoding = cstate->opts.file_encoding; /* -* Set up encoding conversion info. Even if the file and server encodings -* are the same, we must apply pg_any_to_server() to validate data in -* multibyte encodings. +* Set up encoding conversion info. We use pg_server_to_any() for the +* conversion. pg_server_to_any() does nothing with an encoding that +* equals to GetDatabaseEncoding() and PG_SQL_ASCII. If +* cstate->file_encoding equals to GetDatabaseEncoding() and PG_SQL_ASCII, +* we don't need to transcode. */ - cstate->need_transcoding = - (cstate->file_encoding != GetDatabaseEncoding() || -pg_database_encoding_max_length() > 1); + cstate->need_transcoding = !(cstate->file_encoding == GetDatabaseEncoding() || + cstate->file_encoding == PG_SQL_ASCII); /* See Multibyte encoding comment above */ cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding); Numbers on my environment: master: 861.646ms patched: 697.547ms SQL: COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 100::int4)) TO '/dev/null' \watch c=5 Thanks, -- kou From 387f11bde4eb928af23f6a66101aa9d63b3dcfdf Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Thu, 8 Feb 2024 17:17:25 +0900 Subject: [PATCH v1] Add a test for invalid encoding for COPY FROM The test data uses UTF-8 "hello" in Japanese but the test specifies EUC_JP. So it's an invalid data. --- src/test/regress/expected/copyencoding.out | 8 src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 10 ++ 3 files changed, 19 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00..aa4ecea09e --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,8 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1d8a414eea..238cef28c4 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # -- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # -- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00..07c85e988b --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,10 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- + +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +こんにちは +\. + +DROP TABLE test; -- 2.43.0
Re: confusing / inefficient "need_transcoding" handling in copy
On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote: > I think the code is just very confusing - there actually *is* verification of > the encoding, it just happens at a different, earlier, layer, namely in > copyfromparse.c: CopyConvertBuf() which says: > /* >* If the file and server encoding are the same, no encoding conversion > is >* required. However, we still need to verify that the input is valid > for >* the encoding. >*/ > > And does indeed verify that. This has been switched by Heikki in f82de5c46bdf, in 2021, that has removed pg_database_encoding_max_length() in the COPY FROM case. (Adding him now in CC, in case he has any comments). > One unfortunate issue: We don't have any tests verifying that COPY FROM > catches encoding issues. Oops. Anyway, I was looking at the copyto.c code because I need to get something on this thread to be able to do something about the pluggable APIs in COPY, and echoing with what you mentioned upthread, what we only need to do is to set need_transcoding only when the client and the server encodings are not the same? Am I missing something? Runtime gets much better in average, around 1260ms on HEAD vs 1023ms with the attached for the example of upthread on a single process. Some profile data from CopyOneRowTo(), if relevant: * HEAD: - 82.78%10.96% postgres postgres[.] CopyOneRowTo - 71.82% CopyOneRowTo + 30.87% OutputFunctionCall - 13.21% CopyAttributeOutText pg_server_to_any - 9.48% appendBinaryStringInfo 4.93% enlargeStringInfo 3.33% 0xa4e1e234 + 3.20% CopySendEndOfRow 2.66% 0xa4e1e214 1.02% pgstat_progress_update_param 0.86% memcpy@plt 0.74% 0xa4e1cba4 0.72% MemoryContextReset 0.72% 0xa4e1cba8 0.59% enlargeStringInfo 0.55% 0xa4e1cb40 0.54% 0xa4e1cb74 0.52% 0xa4e1cb8c + 10.96% _start * patch: - 80.82%12.25% postgres postgres[.] CopyOneRowTo - 68.57% CopyOneRowTo + 36.55% OutputFunctionCall 11.44% CopyAttributeOutText + 8.87% appendBinaryStringInfo + 3.79% CopySendEndOfRow 1.01% pgstat_progress_update_param 0.79% int2out 0.66% MemoryContextReset 0.63% 0xaa624ba8 0.60% memcpy@plt 0.60% enlargeStringInfo 0.53% 0xaa624ba4 + 12.25% _start That's a performance-only change, but there may be a good argument for backpatching something, perhaps? -- Michael From 6ddbcd4d6333bc96c21ca95d632d83f1e1459064 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 8 Feb 2024 16:03:39 +0900 Subject: [PATCH] Speedup COPY TO --- src/backend/commands/copyto.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index bd4710a79b..c6b45cab53 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -612,13 +612,15 @@ BeginCopyTo(ParseState *pstate, cstate->file_encoding = cstate->opts.file_encoding; /* - * Set up encoding conversion info. Even if the file and server encodings - * are the same, we must apply pg_any_to_server() to validate data in - * multibyte encodings. + * Set up encoding conversion info, validating data if server and + * client encodings are not the same (see also pg_server_to_any). */ - cstate->need_transcoding = - (cstate->file_encoding != GetDatabaseEncoding() || - pg_database_encoding_max_length() > 1); + if (cstate->file_encoding == GetDatabaseEncoding() || + cstate->file_encoding == PG_SQL_ASCII) + cstate->need_transcoding = false; + else + cstate->need_transcoding = true; + /* See Multibyte encoding comment above */ cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding); -- 2.43.0 signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, On 2024-02-06 12:51:48 -0500, Tom Lane wrote: > Michael Paquier writes: > > On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote: > >> I haven't yet dug into the code history. One guess is that this should only > >> have been set this way for COPY FROM. > > > Looking the git history, this looks like an oversight of c61a2f58418e > > that has added the condition on pg_database_encoding_max_length(), no? > > Adding Tom and Ishii-san, even if this comes from 2005. > > Yeah, back in 8.1 that code was applied for both directions, but > probably it should have enforced validation for same-encoding > cases only for COPY FROM. > > It looks like now we have a mess, because the condition was copied > verbatim into copyto.c but not copyfrom.c. Aren't we failing to > validate encoding in this case in COPY FROM, which is where we > actually need to? I think the code is just very confusing - there actually *is* verification of the encoding, it just happens at a different, earlier, layer, namely in copyfromparse.c: CopyConvertBuf() which says: /* * If the file and server encoding are the same, no encoding conversion is * required. However, we still need to verify that the input is valid for * the encoding. */ And does indeed verify that. One unfortunate issue: We don't have any tests verifying that COPY FROM catches encoding issues. Regards, Andres
Re: confusing / inefficient "need_transcoding" handling in copy
Michael Paquier writes: > On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote: >> I haven't yet dug into the code history. One guess is that this should only >> have been set this way for COPY FROM. > Looking the git history, this looks like an oversight of c61a2f58418e > that has added the condition on pg_database_encoding_max_length(), no? > Adding Tom and Ishii-san, even if this comes from 2005. Yeah, back in 8.1 that code was applied for both directions, but probably it should have enforced validation for same-encoding cases only for COPY FROM. It looks like now we have a mess, because the condition was copied verbatim into copyto.c but not copyfrom.c. Aren't we failing to validate encoding in this case in COPY FROM, which is where we actually need to? regards, tom lane
Re: confusing / inefficient "need_transcoding" handling in copy
On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote: > I don't really understand why we need to validate anything during COPY TO? > Which is good, because it turns out that we don't actually validate anything, > as pg_server_to_any() returns without doing anything if the encoding matches: > > if (encoding == DatabaseEncoding->encoding || > encoding == PG_SQL_ASCII) > return unconstify(char *, s); /* assume data is valid */ > > This means that the strlen() we do in the call do pg_server_to_any(), which on > its own takes 14.25% of the cycles, computes something that will never be > used. Indeed, that's wasting cycles for nothing when the client and server encoding match. > Unsurprisingly, only doing transcoding when encodings differ yields a sizable > improvement, about 18% for [2]. > > I haven't yet dug into the code history. One guess is that this should only > have been set this way for COPY FROM. Looking the git history, this looks like an oversight of c61a2f58418e that has added the condition on pg_database_encoding_max_length(), no? Adding Tom and Ishii-san, even if this comes from 2005. -- Michael signature.asc Description: PGP signature