Re: Almost bug in COPY FROM processing of GB18030 encoded input

2019-01-25 Thread Heikki Linnakangas

On 24/01/2019 23:27, Robert Haas wrote:

On Wed, Jan 23, 2019 at 6:23 AM Heikki Linnakangas  wrote:

I happened to notice that when CopyReadLineText() calls mblen(), it
passes only the first byte of the multi-byte characters. However,
pg_gb18030_mblen() looks at the first and the second byte.
CopyReadLineText() always passes \0 as the second byte, so
pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded
characters as 2.

It works out fine, though, because the second half of the 4-byte encoded
character always looks like another 2-byte encoded character, in
GB18030. CopyReadLineText() is looking for delimiter and escape
characters and newlines, and only single-byte characters are supported
for those, so treating a 4-byte character as two 2-byte characters is
harmless.


Yikes.


Committed the comment changes, so it's less of a gotcha now.

- Heikki



Re: Almost bug in COPY FROM processing of GB18030 encoded input

2019-01-24 Thread Robert Haas
On Wed, Jan 23, 2019 at 6:23 AM Heikki Linnakangas  wrote:
> I happened to notice that when CopyReadLineText() calls mblen(), it
> passes only the first byte of the multi-byte characters. However,
> pg_gb18030_mblen() looks at the first and the second byte.
> CopyReadLineText() always passes \0 as the second byte, so
> pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded
> characters as 2.
>
> It works out fine, though, because the second half of the 4-byte encoded
> character always looks like another 2-byte encoded character, in
> GB18030. CopyReadLineText() is looking for delimiter and escape
> characters and newlines, and only single-byte characters are supported
> for those, so treating a 4-byte character as two 2-byte characters is
> harmless.

Yikes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Almost bug in COPY FROM processing of GB18030 encoded input

2019-01-23 Thread Heikki Linnakangas

Hi,

I happened to notice that when CopyReadLineText() calls mblen(), it 
passes only the first byte of the multi-byte characters. However, 
pg_gb18030_mblen() looks at the first and the second byte. 
CopyReadLineText() always passes \0 as the second byte, so 
pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded 
characters as 2.


It works out fine, though, because the second half of the 4-byte encoded 
character always looks like another 2-byte encoded character, in 
GB18030. CopyReadLineText() is looking for delimiter and escape 
characters and newlines, and only single-byte characters are supported 
for those, so treating a 4-byte character as two 2-byte characters is 
harmless.


Attached is a patch to explain that in the comments. Grepping for 
mblen(), I didn't find any other callers that used mblen() like that.


- Heikki
>From b0a09c240a2b4d1120433828ca052a2542ff362d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 23 Jan 2019 13:04:05 +0200
Subject: [PATCH 1/1] Fix comments to that claimed that mblen() only looks at
 first byte.

GB18030's mblen() function looks at the first and the second byte of the
multibyte character, to determine its length. copy.c had made the assumption
that mblen() only looks at the first byte, but it turns out to work out
fine, because of the way the GB18030 encoding works. COPY will see a 4-byte
encoded character as two 2-byte encoded characters, which is enough for
COPY's purposes. It cannot mix those up with delimiter or escaping
characters, because only single-byte characters are supported as
delimiters or escape characters.
---
 src/backend/commands/copy.c  |  7 ++-
 src/backend/utils/mb/wchar.c | 32 +---
 2 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index a61a628471..5074ce0daa 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -4073,9 +4073,14 @@ not_end_of_copy:
 		{
 			int			mblen;
 
+			/*
+			 * It is enough to look at the first byte in all our encodings, to
+			 * get the length.  (GB18030 is a bit special, but still works for
+			 * our purposes; see comment in pg_gb18030_mblen())
+			 */
 			mblen_str[0] = c;
-			/* All our encodings only read the first byte to get the length */
 			mblen = pg_encoding_mblen(cstate->file_encoding, mblen_str);
+
 			IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(mblen - 1);
 			IF_NEED_REFILL_AND_EOF_BREAK(mblen - 1);
 			raw_buf_ptr += mblen - 1;
diff --git a/src/backend/utils/mb/wchar.c b/src/backend/utils/mb/wchar.c
index a5fdda456e..8e5116dfc1 100644
--- a/src/backend/utils/mb/wchar.c
+++ b/src/backend/utils/mb/wchar.c
@@ -15,16 +15,23 @@
 
 
 /*
- * conversion to pg_wchar is done by "table driven."
- * to add an encoding support, define mb2wchar_with_len(), mblen(), dsplen()
- * for the particular encoding. Note that if the encoding is only
- * supported in the client, you don't need to define
- * mb2wchar_with_len() function (SJIS is the case).
+ * Operations on multi-byte encodings are driven by a table of helper
+ * functions.
+ *
+ * To add an encoding support, define mblen(), dsplen() and verifier() for
+ * the encoding.  For server-encodings, also define mb2wchar() and wchar2mb()
+ * conversion functions.
  *
  * These functions generally assume that their input is validly formed.
  * The "verifier" functions, further down in the file, have to be more
- * paranoid.  We expect that mblen() does not need to examine more than
- * the first byte of the character to discover the correct length.
+ * paranoid.
+ *
+ * We expect that mblen() does not need to examine more than the first byte
+ * of the character to discover the correct length.  GB18030 is an exception
+ * to that rule, though, as it also looks at second byte.  But even that
+ * behaves in a predictable way, if you only pass the first byte: it will
+ * treat 4-byte encoded characters as two 2-byte encoded characters, which is
+ * good enough for all current uses.
  *
  * Note: for the display output of psql to work properly, the return values
  * of the dsplen functions must conform to the Unicode standard. In particular
@@ -1073,6 +1080,17 @@ pg_uhc_dsplen(const unsigned char *s)
  * GB18030
  *	Added by Bill Huang ,
  */
+
+/*
+ * Unlike all other mblen() functions, this also looks at the second byte of
+ * the input.  However, if you only pass the first byte of a multi-byte
+ * string, and \0 as the second byte, this still works in a predictable way:
+ * a 4-byte character will be reported as two 2-byte characters.  That's
+ * enough for all current uses, as a client-only encoding.  It works that
+ * way, because in any valid 4-byte GB18030-encoded character, the third and
+ * fourth byte look like a 2-byte encoded character, when looked at
+ * separately.
+ */
 static int
 pg_gb18030_mblen(const unsigned char *s)
 {
-- 
2.20.1