Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
On Thu, Jun 9, 2011 at 12:39 AM, Jeevan Chalke jeevan.cha...@enterprisedb.com wrote: It's a problem, but without an efficient algorithm for Unicode case folding, any fix we attempt to implement seems like it'll just be moving the problem around. Agree. I read on other mail thread that str_tolower() is a wide-character-aware lower function but it is also a collation-aware and hence might change its behaviour wrt change in locale. However, Tom suggested that we need to have non-locale-dependent case folding algorithm. But still for same locale on same machine, where we can able to create a table, insert some data, we cannot retrieve it. Don't you think it is more serious and we need a quick solution here? As said earlier it may even lead to pg_dump failures. Given that str_tolower() functionality is locale dependent but still it will resolve this particular issue. Not sure, there might be a performance issue but at-least we are not giving an error. Well, as I understand it, the problem here is that if someone goes and changes the locale, then you might massively break the user's application. For example, if the user says: CREATE TABLE FOO (...); SELECT * FROM FOO; ...that'll work, of course, because whatever you get when you downcase FOO will be the same both times. But if the locale now changes, then the next... SELECT * FROM FOO; ...might fail, because the new downcasing of FOO might not match the old one. You could argue that that's better than the current situation, but it's not clear-cut. But now that I re-think about it, I guess what I'm confused about is this code here: if (ch = 'A' ch = 'Z') ch += 'a' - 'A'; else if (IS_HIGHBIT_SET(ch) isupper(ch)) ch = tolower(ch); result[i] = (char) ch; It seems to me that we're downcasing the first byte of each wide character and ignoring the rest... which seems like it can't possibly be a good idea in a multi-byte encoding. Perhaps we could keep that approach for single-byte encodings and just pass through multi-byte characters untouched? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Robert Haas robertmh...@gmail.com writes: But now that I re-think about it, I guess what I'm confused about is this code here: if (ch = 'A' ch = 'Z') ch += 'a' - 'A'; else if (IS_HIGHBIT_SET(ch) isupper(ch)) ch = tolower(ch); result[i] = (char) ch; The expected behavior there is that case-folding of non-ASCII characters will occur in single-byte encodings but nothing will happen to multi-byte characters. We are relying on isupper() to not return true when presented with a character fragment in a multibyte locale. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: But now that I re-think about it, I guess what I'm confused about is this code here: if (ch = 'A' ch = 'Z') ch += 'a' - 'A'; else if (IS_HIGHBIT_SET(ch) isupper(ch)) ch = tolower(ch); result[i] = (char) ch; The expected behavior there is that case-folding of non-ASCII characters will occur in single-byte encodings but nothing will happen to multi-byte characters. We are relying on isupper() to not return true when presented with a character fragment in a multibyte locale. Based on Jeevan's original message, it seems like that's not always the case, at least on Windows. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Robert Haas robertmh...@gmail.com writes: On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane t...@sss.pgh.pa.us wrote: We are relying on isupper() to not return true when presented with a character fragment in a multibyte locale. Based on Jeevan's original message, it seems like that's not always the case, at least on Windows. Hmm. Now that you mention it, I think the same has been said about OSX. If we need to work around brain-dead isupper() tests, maybe the best thing is to implement two versions of the loop: if (encoding is single byte) ... loop as it stands ... else ... loop without the else if part regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jun 9, 2011 at 10:07 AM, Tom Lane t...@sss.pgh.pa.us wrote: We are relying on isupper() to not return true when presented with a character fragment in a multibyte locale. Based on Jeevan's original message, it seems like that's not always the case, at least on Windows. Hmm. Now that you mention it, I think the same has been said about OSX. If we need to work around brain-dead isupper() tests, maybe the best thing is to implement two versions of the loop: if (encoding is single byte) ... loop as it stands ... else ... loop without the else if part That seems like a clear improvement. It's a long way from perfect, but still worthwhile. Would we back-patch that? Just do it in master? Wait for 9.2? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Robert Haas robertmh...@gmail.com writes: On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: If we need to work around brain-dead isupper() tests, maybe the best thing is to implement two versions of the loop: if (encoding is single byte) ... loop as it stands ... else ... loop without the else if part That seems like a clear improvement. It's a long way from perfect, but still worthwhile. Would we back-patch that? Just do it in master? Wait for 9.2? It looks to me like a portability bug fix, so I'd say back-patch. I'll take care of it, unless you're hot to do so? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
On Thu, Jun 9, 2011 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: If we need to work around brain-dead isupper() tests, maybe the best thing is to implement two versions of the loop: if (encoding is single byte) ... loop as it stands ... else ... loop without the else if part That seems like a clear improvement. It's a long way from perfect, but still worthwhile. Would we back-patch that? Just do it in master? Wait for 9.2? It looks to me like a portability bug fix, so I'd say back-patch. I'll take care of it, unless you're hot to do so? Nope, have at it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Robert Haas robertmh...@gmail.com writes: On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: If we need to work around brain-dead isupper() tests, maybe the best thing is to implement two versions of the loop: if (encoding is single byte) That seems like a clear improvement. It's a long way from perfect, but still worthwhile. Hmm ... while the above is easy enough to do in the backend, where we can look at pg_database_encoding_max_length, we have also got instances of this coding pattern in src/port/pgstrcasecmp.c. It's a lot less obvious how to make the test in frontend environments. Thoughts anyone? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jun 9, 2011 at 10:15 AM, Tom Lane t...@sss.pgh.pa.us wrote: If we need to work around brain-dead isupper() tests, maybe the best thing is to implement two versions of the loop: if (encoding is single byte) That seems like a clear improvement. It's a long way from perfect, but still worthwhile. Hmm ... while the above is easy enough to do in the backend, where we can look at pg_database_encoding_max_length, we have also got instances of this coding pattern in src/port/pgstrcasecmp.c. It's a lot less obvious how to make the test in frontend environments. Thoughts anyone? I'm not sure if this helps at all, but an awful lot of those tests are against hard-coded strings that are known to contain only ASCII characters. Is there some way we can optimize this for that case? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Robert Haas robertmh...@gmail.com writes: On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane t...@sss.pgh.pa.us wrote: Hmm ... while the above is easy enough to do in the backend, where we can look at pg_database_encoding_max_length, we have also got instances of this coding pattern in src/port/pgstrcasecmp.c. It's a lot less obvious how to make the test in frontend environments. Thoughts anyone? I'm not sure if this helps at all, but an awful lot of those tests are against hard-coded strings that are known to contain only ASCII characters. Is there some way we can optimize this for that case? For the places where we're just looking for a match to a fixed all-ASCII string, an ASCII-only downcasing would be sufficient, and would eliminate the whole problem. But I doubt all the callers fall into that class. What I'm particularly worried about at the moment is whether we are assuming anywhere that the frontend side can duplicate the backend's identifier downcasing behavior. That seems like a complete morass, because (1) they might not have the same locale, (2) they might not have the same encoding, (3) even if they do, the same locale is known to behave differently on different platforms. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
On Thu, Jun 9, 2011 at 1:22 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Jun 9, 2011 at 11:17 AM, Tom Lane t...@sss.pgh.pa.us wrote: Hmm ... while the above is easy enough to do in the backend, where we can look at pg_database_encoding_max_length, we have also got instances of this coding pattern in src/port/pgstrcasecmp.c. It's a lot less obvious how to make the test in frontend environments. Thoughts anyone? I'm not sure if this helps at all, but an awful lot of those tests are against hard-coded strings that are known to contain only ASCII characters. Is there some way we can optimize this for that case? For the places where we're just looking for a match to a fixed all-ASCII string, an ASCII-only downcasing would be sufficient, and would eliminate the whole problem. But I doubt all the callers fall into that class. What I'm particularly worried about at the moment is whether we are assuming anywhere that the frontend side can duplicate the backend's identifier downcasing behavior. That seems like a complete morass, because (1) they might not have the same locale, (2) they might not have the same encoding, (3) even if they do, the same locale is known to behave differently on different platforms. Right. Understood. So let's look at the cases (from git grep pg_strcasecmp and pg_strncasecmp): contrib/dict_int: Fixed strings only, and it's all backend code anyway. contrib/dict_xsyn: Fixed strings only, and it's all backend code anyway. contrib/hstore: Fixed strings only, and it's all backend code anyway. contrib/pg_upgrade: Used to compare LC_COLLATE, LC_CTYPE, and encoding names. contrib/pgbench: Definitely front-end code, but it's all fixed strings. contrib/pgcrypto: All fixed strings except for one instance in px_find_digit. But it's all backend contrib/spi: One instance, not a fixed string, but it's backend code. contrib/unaccent: One instance, not a fixed string, but it's backend code. src/backend/*: Backend code, obviously. src/bin/initdb: Strings from a constant lookup table (tsearch_config_languages) only. src/bin/pg_basebackup: Fixed strings only. src/bin/pg_ctl: Fixed strings only. src/bin/pg_dump: Fixed strings only. src/bin/psql: Fixed strings only. In a couple of cases they are not constants - help.c uses strings from to generated file sql_help.h, and tab-complete.c uses strings from a constant array called words_after_create[]. But these are constant lookup tables. src/include: access/reloptions.h uses strncasecmp() as part of a macro. That should be OK as long as no one tries to include this in frontend code, which seems rather impractical. src/interfaces/ecpg/ecpglib: Fixed strings. src/interfaces/ecpg/pgtypeslib: Fixed strings, and strings from a constant lookup table, only. src/interfaces/ecpg/preproc: This looks a bit worrisome. It seems we might be using it on identifiers here. src/interfaces/libpq: This is attempting to match a wildcard certificate name against a hostname, in two different places. src/port/chklocale.c: Fixed strings or ones from a lookup table. src/timezone/pgtz.c: Matches input strings against filenames read from the OS. So mostly I think these are OK. The instance in src/interfaces/ecpg/preproc looks like the most likely candidate for a problem spot. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Robert Haas robertmh...@gmail.com writes: Right. Understood. So let's look at the cases (from git grep pg_strcasecmp and pg_strncasecmp): Also pg_toupper and pg_tolower. Right offhand, it looks like psql *does* assume it can lower-case identifiers this way :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
On Thu, Jun 9, 2011 at 2:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Right. Understood. So let's look at the cases (from git grep pg_strcasecmp and pg_strncasecmp): Also pg_toupper and pg_tolower. Right offhand, it looks like psql *does* assume it can lower-case identifiers this way :-( Blarg. I dunno what to do about that - but surely we must have the encoding available somewhere? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
On Wed, Jun 8, 2011 at 6:22 AM, Robert Haas robertmh...@gmail.com wrote: 2011/6/7 Jeevan Chalke jeevan.cha...@enterprisedb.com: since we smash the identifier to lower case using downcase_truncate_identifier() function, the solution is to make this function should be wide-char aware, like LOWER() function functionality. I see some discussion related to downcase_truncate_identifier() and wide-char aware function, but seems like we lost somewhere. (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php) This invalid byte sequence issue seems like a more serious issue, because it might lead e.g to pg_dump failures. It's a problem, but without an efficient algorithm for Unicode case folding, any fix we attempt to implement seems like it'll just be moving the problem around. Agree. I read on other mail thread that str_tolower() is a wide-character-aware lower function but it is also a collation-aware and hence might change its behaviour wrt change in locale. However, Tom suggested that we need to have non-locale-dependent case folding algorithm. But still for same locale on same machine, where we can able to create a table, insert some data, we cannot retrieve it. Don't you think it is more serious and we need a quick solution here? As said earlier it may even lead to pg_dump failures. Given that str_tolower() functionality is locale dependent but still it will resolve this particular issue. Not sure, there might be a performance issue but at-least we are not giving an error. Please excuse me, if community already had a lot of discussion and kept this behaviour intentionally knowing all these errors and serious issues. Thanks -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Jeevan B Chalke Senior Software Engineer, RD EnterpriseDB Corporation The Enterprise PostgreSQL Company Phone: +91 20 30589500 Website: www.enterprisedb.com EnterpriseDB Blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
[HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Hi Tom, Issue is on Windows: If you see in attached failure.out file, (after running failure.sql) we are getting ERROR: invalid byte sequence for encoding UTF8: 0xe59aff error. Please note that byte sequence we got from database is e5 9a ff, where as actual byte sequence for the wide character '功' is e5 8a 9f. '功' == UNICODE Character e5 8a 9f == Original Byte Sequence for the given characters e5 9a ff == downcase_truncate_identifier() result, which is invalid UTF8 representation, stored in pg_catalog table While displaying on client, we receive this invalid byte sequence which throws an error. Note that UTF8 characters have predefined character ranges for each byte which is checked in pg_utf8_islegal() function. Here is the code snippet: == a = source[2]; if (a 0x80 || a 0xBF) return false; == Note that source[2] = ff, which does not fall into the valid range which results in illegal UTF8 character sequence. If you carefully see the original one i.e. 9f, which falls within the range. since we smash the identifier to lower case using downcase_truncate_identifier() function, the solution is to make this function should be wide-char aware, like LOWER() function functionality. I see some discussion related to downcase_truncate_identifier() and wide-char aware function, but seems like we lost somewhere. (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php) This invalid byte sequence issue seems like a more serious issue, because it might lead e.g to pg_dump failures. I have tested this on PG9.0 beta4 (one click installers), BTW, we have observed same with earlier version as well. Attached is the .sql and its output (run on PG9.0 beta4). Any thoughts??? Thanks -- Jeevan B Chalke Senior Software Engineer, RD EnterpriseDB Corporation The Enterprise PostgreSQL Company Phone: +91 20 30589500 Website: www.enterprisedb.com EnterpriseDB Blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message. SELECT version(); set client_encoding to EUC_CN; SELECT name,setting FROM pg_settings WHERE name like 'lc%' OR name like '%encoding'; create table 加入 ( 用户名 text, 新功能 varchar); insert into 加入 values('- 隐私政策 ',' 使用条款'); insert into 加入 values('计划政策',' 登录到'); select 新功能 from 加入; select * from 加入; drop table 加入 ; failure.out Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Invalid byte sequence for encoding UTF8, caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
2011/6/7 Jeevan Chalke jeevan.cha...@enterprisedb.com: since we smash the identifier to lower case using downcase_truncate_identifier() function, the solution is to make this function should be wide-char aware, like LOWER() function functionality. I see some discussion related to downcase_truncate_identifier() and wide-char aware function, but seems like we lost somewhere. (http://archives.postgresql.org/pgsql-hackers/2010-11/msg01385.php) This invalid byte sequence issue seems like a more serious issue, because it might lead e.g to pg_dump failures. It's a problem, but without an efficient algorithm for Unicode case folding, any fix we attempt to implement seems like it'll just be moving the problem around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers