Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In "Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 06:56:16 +0900, Michael Paquier wrote: > We have a couple of non-ASCII characters in the tests, but I suspect > that this one will not be digested correctly everywhere, even if > EUC_JP should be OK to use for the check. How about writing an > arbitrary sequence of bytes into a temporary file that gets used for > the COPY FROM instead? See for example how we do that with > abs_builddir in copy.sql. It makes sense. How about the attached patch? Thanks, -- kou >From 6eb9669f97c54f8b85fac63db40ad80664692d12 Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Wed, 14 Feb 2024 11:44:13 +0900 Subject: [PATCH v2] Add a test for invalid encoding for COPY FROM The test data use an UTF-8 character (U+3042 HIRAGANA LETTER A) but the test specifies EUC_JP. So it's an invalid data. --- src/test/regress/expected/copyencoding.out | 13 + src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 15 +++ 3 files changed, 29 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00..32a9d918fa --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,13 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR +CREATE TABLE test (t text); +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1d8a414eea..238cef28c4 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # -- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # -- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00..89e2124996 --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,15 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- + +-- directory paths are passed to us in environment variables +\getenv abs_builddir PG_ABS_BUILDDIR + +CREATE TABLE test (t text); + +\set utf8_csv :abs_builddir '/results/copyencoding_utf8.csv' +-- U+3042 HIRAGANA LETTER A +COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8'); +COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP'); + +DROP TABLE test; -- 2.43.0
Re: confusing / inefficient "need_transcoding" handling in copy
On Thu, Feb 08, 2024 at 05:25:01PM +0900, Sutou Kouhei wrote: > In <20240206222445.hzq22pb2nye7r...@awork3.anarazel.de> > "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 > Feb 2024 14:24:45 -0800, > Andres Freund wrote: > >> One unfortunate issue: We don't have any tests verifying that COPY FROM >> catches encoding issues. > > How about the attached patch for it? > > +CREATE TABLE test (t text); > +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); > +こんにちは > +\. > + > +DROP TABLE test; We have a couple of non-ASCII characters in the tests, but I suspect that this one will not be digested correctly everywhere, even if EUC_JP should be OK to use for the check. How about writing an arbitrary sequence of bytes into a temporary file that gets used for the COPY FROM instead? See for example how we do that with abs_builddir in copy.sql. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, On 2024-02-09 09:36:28 +0900, Michael Paquier wrote: > On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote: > > There's no validation, just conversion. I'd suggest: > > > > "Set up encoding conversion info if the file and server encodings differ > > (see also pg_server_to_any)." > > > > Other than that, +1 > > Cool. I've used your wording and applied that on HEAD. Thanks. LGTM. Greetings, Andres Freund
Re: confusing / inefficient "need_transcoding" handling in copy
On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote: > There's no validation, just conversion. I'd suggest: > > "Set up encoding conversion info if the file and server encodings differ > (see also pg_server_to_any)." > > Other than that, +1 Cool. I've used your wording and applied that on HEAD. > BTW, I can see an optimization opportunity even if the encodings differ: > Currently, CopyAttributeOutText() calls pg_server_to_any(), and then grovels > through the string to find any characters that need to be quoted. You could > do it the other way round and handle quoting before the conversion. That has > two benefits: > > 1. You don't need the strlen() call, because you just scanned through the > string so you already know its length. > 2. You don't need to worry about 'encoding_embeds_ascii' when you operate on > the server encoding. That sounds right, still it looks like there would be cases where you'd need the strlen() call if !encoding_embeds_ascii. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
On Thu, Feb 08, 2024 at 05:29:46PM +0900, Sutou Kouhei wrote: > Oh, sorry. I missed the Michael's patch: > https://www.postgresql.org/message-id/flat/ZcR9Q9hJ8GedFSCd%40paquier.xyz#e73272b042a22befac7a95f7bcb4fb9a > > I withdraw my change. No problem. Thanks for caring about that. -- Michael signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In <20240208.172501.2177371292839763981@clear-code.com> "Re: confusing / inefficient "need_transcoding" handling in copy" on Thu, 08 Feb 2024 17:25:01 +0900 (JST), Sutou Kouhei wrote: > How about the following to avoid needless transcoding? Oh, sorry. I missed the Michael's patch: https://www.postgresql.org/message-id/flat/ZcR9Q9hJ8GedFSCd%40paquier.xyz#e73272b042a22befac7a95f7bcb4fb9a I withdraw my change. Thanks, -- kou
Re: confusing / inefficient "need_transcoding" handling in copy
On 08/02/2024 09:05, Michael Paquier wrote: On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote: I think the code is just very confusing - there actually *is* verification of the encoding, it just happens at a different, earlier, layer, namely in copyfromparse.c: CopyConvertBuf() which says: /* * If the file and server encoding are the same, no encoding conversion is * required. However, we still need to verify that the input is valid for * the encoding. */ And does indeed verify that. This has been switched by Heikki in f82de5c46bdf, in 2021, that has removed pg_database_encoding_max_length() in the COPY FROM case. (Adding him now in CC, in case he has any comments). Yeah, I agree the "pg_database_encoding_max_length() > 1" check in COPY TO is unnecessary. It's always been like that, but now that the COPY TO and COPY FROM cases don't share the code, it's more obvious. Let's remove it. On your patch: +* Set up encoding conversion info, validating data if server and +* client encodings are not the same (see also pg_server_to_any). There's no validation, just conversion. I'd suggest: "Set up encoding conversion info if the file and server encodings differ (see also pg_server_to_any)." Other than that, +1 That's a performance-only change, but there may be a good argument for backpatching something, perhaps? -1 for backpatching, out of principle. This is not a regression, it's always been like that. Seems innocent, but why is this different from any other performance improvement we make. BTW, I can see an optimization opportunity even if the encodings differ: Currently, CopyAttributeOutText() calls pg_server_to_any(), and then grovels through the string to find any characters that need to be quoted. You could do it the other way round and handle quoting before the conversion. That has two benefits: 1. You don't need the strlen() call, because you just scanned through the string so you already know its length. 2. You don't need to worry about 'encoding_embeds_ascii' when you operate on the server encoding. -- Heikki Linnakangas Neon (https://neon.tech)
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, In <20240206222445.hzq22pb2nye7r...@awork3.anarazel.de> "Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800, Andres Freund wrote: > One unfortunate issue: We don't have any tests verifying that COPY FROM > catches encoding issues. How about the attached patch for it? How about the following to avoid needless transcoding? diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index bd4710a79b..80ec26cafe 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -612,13 +612,14 @@ BeginCopyTo(ParseState *pstate, cstate->file_encoding = cstate->opts.file_encoding; /* -* Set up encoding conversion info. Even if the file and server encodings -* are the same, we must apply pg_any_to_server() to validate data in -* multibyte encodings. +* Set up encoding conversion info. We use pg_server_to_any() for the +* conversion. pg_server_to_any() does nothing with an encoding that +* equals to GetDatabaseEncoding() and PG_SQL_ASCII. If +* cstate->file_encoding equals to GetDatabaseEncoding() and PG_SQL_ASCII, +* we don't need to transcode. */ - cstate->need_transcoding = - (cstate->file_encoding != GetDatabaseEncoding() || -pg_database_encoding_max_length() > 1); + cstate->need_transcoding = !(cstate->file_encoding == GetDatabaseEncoding() || + cstate->file_encoding == PG_SQL_ASCII); /* See Multibyte encoding comment above */ cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding); Numbers on my environment: master: 861.646ms patched: 697.547ms SQL: COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 100::int4)) TO '/dev/null' \watch c=5 Thanks, -- kou From 387f11bde4eb928af23f6a66101aa9d63b3dcfdf Mon Sep 17 00:00:00 2001 From: Sutou Kouhei Date: Thu, 8 Feb 2024 17:17:25 +0900 Subject: [PATCH v1] Add a test for invalid encoding for COPY FROM The test data uses UTF-8 "hello" in Japanese but the test specifies EUC_JP. So it's an invalid data. --- src/test/regress/expected/copyencoding.out | 8 src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/copyencoding.sql | 10 ++ 3 files changed, 19 insertions(+), 1 deletion(-) create mode 100644 src/test/regress/expected/copyencoding.out create mode 100644 src/test/regress/sql/copyencoding.sql diff --git a/src/test/regress/expected/copyencoding.out b/src/test/regress/expected/copyencoding.out new file mode 100644 index 00..aa4ecea09e --- /dev/null +++ b/src/test/regress/expected/copyencoding.out @@ -0,0 +1,8 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +ERROR: invalid byte sequence for encoding "EUC_JP": 0xe3 0x81 +CONTEXT: COPY test, line 1 +DROP TABLE test; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 1d8a414eea..238cef28c4 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -36,7 +36,7 @@ test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comment # execute two copy tests in parallel, to check that copy itself # is concurrent safe. # -- -test: copy copyselect copydml insert insert_conflict +test: copy copyselect copydml copyencoding insert insert_conflict # -- # More groups of parallel tests diff --git a/src/test/regress/sql/copyencoding.sql b/src/test/regress/sql/copyencoding.sql new file mode 100644 index 00..07c85e988b --- /dev/null +++ b/src/test/regress/sql/copyencoding.sql @@ -0,0 +1,10 @@ +-- +-- Test cases for COPY WITH (ENCODING) +-- + +CREATE TABLE test (t text); +COPY test FROM stdin WITH (ENCODING 'EUC_JP'); +こんにちは +\. + +DROP TABLE test; -- 2.43.0
Re: confusing / inefficient "need_transcoding" handling in copy
On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote: > I think the code is just very confusing - there actually *is* verification of > the encoding, it just happens at a different, earlier, layer, namely in > copyfromparse.c: CopyConvertBuf() which says: > /* >* If the file and server encoding are the same, no encoding conversion > is >* required. However, we still need to verify that the input is valid > for >* the encoding. >*/ > > And does indeed verify that. This has been switched by Heikki in f82de5c46bdf, in 2021, that has removed pg_database_encoding_max_length() in the COPY FROM case. (Adding him now in CC, in case he has any comments). > One unfortunate issue: We don't have any tests verifying that COPY FROM > catches encoding issues. Oops. Anyway, I was looking at the copyto.c code because I need to get something on this thread to be able to do something about the pluggable APIs in COPY, and echoing with what you mentioned upthread, what we only need to do is to set need_transcoding only when the client and the server encodings are not the same? Am I missing something? Runtime gets much better in average, around 1260ms on HEAD vs 1023ms with the attached for the example of upthread on a single process. Some profile data from CopyOneRowTo(), if relevant: * HEAD: - 82.78%10.96% postgres postgres[.] CopyOneRowTo - 71.82% CopyOneRowTo + 30.87% OutputFunctionCall - 13.21% CopyAttributeOutText pg_server_to_any - 9.48% appendBinaryStringInfo 4.93% enlargeStringInfo 3.33% 0xa4e1e234 + 3.20% CopySendEndOfRow 2.66% 0xa4e1e214 1.02% pgstat_progress_update_param 0.86% memcpy@plt 0.74% 0xa4e1cba4 0.72% MemoryContextReset 0.72% 0xa4e1cba8 0.59% enlargeStringInfo 0.55% 0xa4e1cb40 0.54% 0xa4e1cb74 0.52% 0xa4e1cb8c + 10.96% _start * patch: - 80.82%12.25% postgres postgres[.] CopyOneRowTo - 68.57% CopyOneRowTo + 36.55% OutputFunctionCall 11.44% CopyAttributeOutText + 8.87% appendBinaryStringInfo + 3.79% CopySendEndOfRow 1.01% pgstat_progress_update_param 0.79% int2out 0.66% MemoryContextReset 0.63% 0xaa624ba8 0.60% memcpy@plt 0.60% enlargeStringInfo 0.53% 0xaa624ba4 + 12.25% _start That's a performance-only change, but there may be a good argument for backpatching something, perhaps? -- Michael From 6ddbcd4d6333bc96c21ca95d632d83f1e1459064 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 8 Feb 2024 16:03:39 +0900 Subject: [PATCH] Speedup COPY TO --- src/backend/commands/copyto.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index bd4710a79b..c6b45cab53 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -612,13 +612,15 @@ BeginCopyTo(ParseState *pstate, cstate->file_encoding = cstate->opts.file_encoding; /* - * Set up encoding conversion info. Even if the file and server encodings - * are the same, we must apply pg_any_to_server() to validate data in - * multibyte encodings. + * Set up encoding conversion info, validating data if server and + * client encodings are not the same (see also pg_server_to_any). */ - cstate->need_transcoding = - (cstate->file_encoding != GetDatabaseEncoding() || - pg_database_encoding_max_length() > 1); + if (cstate->file_encoding == GetDatabaseEncoding() || + cstate->file_encoding == PG_SQL_ASCII) + cstate->need_transcoding = false; + else + cstate->need_transcoding = true; + /* See Multibyte encoding comment above */ cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding); -- 2.43.0 signature.asc Description: PGP signature
Re: confusing / inefficient "need_transcoding" handling in copy
Hi, On 2024-02-06 12:51:48 -0500, Tom Lane wrote: > Michael Paquier writes: > > On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote: > >> I haven't yet dug into the code history. One guess is that this should only > >> have been set this way for COPY FROM. > > > Looking the git history, this looks like an oversight of c61a2f58418e > > that has added the condition on pg_database_encoding_max_length(), no? > > Adding Tom and Ishii-san, even if this comes from 2005. > > Yeah, back in 8.1 that code was applied for both directions, but > probably it should have enforced validation for same-encoding > cases only for COPY FROM. > > It looks like now we have a mess, because the condition was copied > verbatim into copyto.c but not copyfrom.c. Aren't we failing to > validate encoding in this case in COPY FROM, which is where we > actually need to? I think the code is just very confusing - there actually *is* verification of the encoding, it just happens at a different, earlier, layer, namely in copyfromparse.c: CopyConvertBuf() which says: /* * If the file and server encoding are the same, no encoding conversion is * required. However, we still need to verify that the input is valid for * the encoding. */ And does indeed verify that. One unfortunate issue: We don't have any tests verifying that COPY FROM catches encoding issues. Regards, Andres
Re: confusing / inefficient "need_transcoding" handling in copy
Michael Paquier writes: > On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote: >> I haven't yet dug into the code history. One guess is that this should only >> have been set this way for COPY FROM. > Looking the git history, this looks like an oversight of c61a2f58418e > that has added the condition on pg_database_encoding_max_length(), no? > Adding Tom and Ishii-san, even if this comes from 2005. Yeah, back in 8.1 that code was applied for both directions, but probably it should have enforced validation for same-encoding cases only for COPY FROM. It looks like now we have a mess, because the condition was copied verbatim into copyto.c but not copyfrom.c. Aren't we failing to validate encoding in this case in COPY FROM, which is where we actually need to? regards, tom lane
Re: confusing / inefficient "need_transcoding" handling in copy
On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote: > I don't really understand why we need to validate anything during COPY TO? > Which is good, because it turns out that we don't actually validate anything, > as pg_server_to_any() returns without doing anything if the encoding matches: > > if (encoding == DatabaseEncoding->encoding || > encoding == PG_SQL_ASCII) > return unconstify(char *, s); /* assume data is valid */ > > This means that the strlen() we do in the call do pg_server_to_any(), which on > its own takes 14.25% of the cycles, computes something that will never be > used. Indeed, that's wasting cycles for nothing when the client and server encoding match. > Unsurprisingly, only doing transcoding when encodings differ yields a sizable > improvement, about 18% for [2]. > > I haven't yet dug into the code history. One guess is that this should only > have been set this way for COPY FROM. Looking the git history, this looks like an oversight of c61a2f58418e that has added the condition on pg_database_encoding_max_length(), no? Adding Tom and Ishii-san, even if this comes from 2005. -- Michael signature.asc Description: PGP signature
confusing / inefficient "need_transcoding" handling in copy
Hi, Looking at the profiles in [1], and similar profiles locally, made me wonder why a basic COPY TO shows pg_server_to_any() and the strlen() to compute the length of the to-be-converted string so heavily in profiles. Example profile, for [2]: - 88.11%12.02% postgres postgres [.] CopyOneRowTo - 76.09% CopyOneRowTo - 37.24% CopyAttributeOutText + 14.25% __strlen_evex + 2.76% pg_server_to_any + 0.03% 0x82a00c86 + 31.82% OutputFunctionCall + 2.98% CopySendEndOfRow + 2.75% appendBinaryStringInfo + 0.58% MemoryContextReset + 0.02% 0x82a00c86 + 12.01% standard_ExecutorRun + 0.02% PostgresMain In the basic cases the client and server encoding should be the same after all, so why do we need to do any conversion? The code has a comment about this: /* * Set up encoding conversion info. Even if the file and server encodings * are the same, we must apply pg_any_to_server() to validate data in * multibyte encodings. */ cstate->need_transcoding = (cstate->file_encoding != GetDatabaseEncoding() || pg_database_encoding_max_length() > 1); I don't really understand why we need to validate anything during COPY TO? Which is good, because it turns out that we don't actually validate anything, as pg_server_to_any() returns without doing anything if the encoding matches: if (encoding == DatabaseEncoding->encoding || encoding == PG_SQL_ASCII) return unconstify(char *, s); /* assume data is valid */ This means that the strlen() we do in the call do pg_server_to_any(), which on its own takes 14.25% of the cycles, computes something that will never be used. Unsurprisingly, only doing transcoding when encodings differ yields a sizable improvement, about 18% for [2]. I haven't yet dug into the code history. One guess is that this should only have been set this way for COPY FROM. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/ZcGE8LrjGW8pmtOf%40paquier.xyz [2] COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 100::int4)) TO '/dev/null';