Re: [HACKERS] Bug in UTF8-Validation Code?
What is the state of play with this item? I think this is a must-fix bug for 8.3. There was a flurry of messages back in April but since then I don't recall seeing anything. cheers andrew Mark Dilger wrote: Mark Dilger wrote: Bruce Momjian wrote: Added to TODO: * Fix cases where invalid byte encodings are accepted by the database, but throw an error on SELECT http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php Is anyone working on fixing this bug? Hi, has anyone volunteered to fix this bug? I did not see any reply on the mailing list to your question above. mark OK, I can take a stab at fixing this. I'd like to state some assumptions so people can comment and reply: I assume that I need to fix *all* cases where invalid byte encodings get into the database through functions shipped in the core distribution. I assume I do not need to worry about people getting bad data into the system through their own database extensions. I assume that the COPY problem discussed up-thread goes away once you eliminate all the paths by which bad data can get into the system. However, existing database installations with bad data already loaded will not be magically fixed with these code patches. Do any of the string functions (see http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the risk of generating invalid utf8 encoded strings? Do I need to add checks? Are there known bugs with these functions in this regard? If not, I assume I can add mbverify calls to the various input routines (textin, varcharin, etc) where invalid utf8 could otherwise enter the system. I assume that this work can be limited to HEAD and that I don't need to back-patch it. (I suspect this assumption is a contentious one.) Advice and comments are welcome, ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
Martijn van Oosterhout wrote: > So your implemntation is simply: > 1. Take number and make UTF-8 string > 2. Convert it to database encoding. Aah, now I can spot where the misunderstanding is. That's not what I mean. I mean that chr() should simply 'typecast' to "char". So when the database encoding is UTF8, I want chr(14844588) to return a Euro sign, and when the encoding is LATIN9, then chr(14844588) should either yield the 'not' sign (UNICODE 0xAC) or an error message, depending on whether we want chr() to operate mod 256 like Oracle has it for single byte character sets or not, while chr(164) should return the Euro sign for LATIN9 database encoding. Yours, Laurenz Albe ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
Martijn van Oosterhout writes: > I think the problem is that most encodings do not have the concept of a > code point anyway, so implementing it for them is fairly useless. Yeah. I'm beginning to think that the right thing to do is (a) make chr/ascii do the same thing as Oracle (ie, as in the discussed patch) (b) make a different pair of functions that translate Unicode code points to/from the current database encoding. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
On Thu, Apr 05, 2007 at 11:52:14AM +0200, Albe Laurenz wrote: > But isn't a simple fix for chr() and ascii(), which does not > require a redesign, a Good Thing for 8.3 if possible? Something > that maintains as much upward and/or Oracle compatibility as > possible while doing away with ascii('EUR') returning 226 in UTF-8? I think the earlier expressed idea of getting chr/ascii to bail on non-ascii character isn't a bad one. I think your idea is bad in the sense that I actually need to know the encoding of the character I want to be able to use it. If I knew the encoding already, I could just use encode(). What I was thinking of was something that, irrespective of the encoding, gave me a string properly encoded with the character I want. Since AFAIK Unicode is the only character set that actually numbers the characters in a way not related to the encoding, so it would seem useful to be able to give a unicode character number and get a string with that character... So your implemntation is simply: 1. Take number and make UTF-8 string 2. Convert it to database encoding. > And I think - correct me if I am wrong - that conversion between > character and integer representation of the character in the current > database encoding is exactly that. AFAIK there is no "integer representation" of a character in anything other than Unicode. Unicode is the only case that cannot be handled by a simple encode/decode. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Bug in UTF8-Validation Code?
On Thu, Apr 05, 2007 at 09:34:25AM +0900, Tatsuo Ishii wrote: > I'm not sure what kind of use case for unicode_char() you are thinking > about. Anyway if you want a "code point" from a character, we could > easily add such functions to all backend encodings currently we > support. Probably it would look like: I think the problem is that most encodings do not have the concept of a code point anyway, so implementing it for them is fairly useless. > An example outputs are: > > ASCII - 41 > ISO 10646 - U+0041 > ISO 10646 - U+29E3D > ISO 8859-1 - a5 > JIS X 0208 - 4141 In every case other than Unicode you're doing the same thing as encode/decode. Since we already have those functions, there's no need to get chr/ascii to duplicate it. In the case of UTF-8 however, it does something that is not done by encode/decode, hence the proposal to simply extend chr/ascii to do that. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Bug in UTF8-Validation Code?
Tatsuo Ishii wrote: > I think we need to continute design discussion, probably > targetting for 8.4, not 8.3. But isn't a simple fix for chr() and ascii(), which does not require a redesign, a Good Thing for 8.3 if possible? Something that maintains as much upward and/or Oracle compatibility as possible while doing away with ascii('EUR') returning 226 in UTF-8? And I think - correct me if I am wrong - that conversion between character and integer representation of the character in the current database encoding is exactly that. I see Tom Lane's point in rejecting chr(0), though. Yours, Laurenz Albe ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
On 2007-04-05, Tatsuo Ishii <[EMAIL PROTECTED]> wrote: >> Andrew - Supernews <[EMAIL PROTECTED]> writes: >> > Thinking about this made me realize that there's another, ahem, elephant >> > in the room here: convert(). >> > By definition convert() returns text strings which are not valid in the >> > server encoding. How can this be addressed? >> >> Remove convert(). Or at least redefine it as dealing in bytea not text. > > That would break some important use cases. > > 1) A user have UTF-8 database which contains various language >data. Each language has its own table. He wants to sort a SELECT >result by using ORDER BY. Since locale cannot handle multiple >languages, he uses C locale and do the SELECT something like this: > >SELECT * FROM french_table ORDER BY convert(t, 'LATIN1'); >SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP'); That works without change if convert(text,text) returns bytea. > > 2) A user has a UTF-8 database but unfortunately his OS's UTF-8 locale >is broken. He decided to use C locale and want to sort the result >from SELECT like this. > >SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP'); That also works without change if convert(text,text) returns bytea. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug in UTF8-Validation Code?
> Andrew - Supernews <[EMAIL PROTECTED]> writes: > > Thinking about this made me realize that there's another, ahem, elephant > > in the room here: convert(). > > By definition convert() returns text strings which are not valid in the > > server encoding. How can this be addressed? > > Remove convert(). Or at least redefine it as dealing in bytea not text. That would break some important use cases. 1) A user have UTF-8 database which contains various language data. Each language has its own table. He wants to sort a SELECT result by using ORDER BY. Since locale cannot handle multiple languages, he uses C locale and do the SELECT something like this: SELECT * FROM french_table ORDER BY convert(t, 'LATIN1'); SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP'); 2) A user has a UTF-8 database but unfortunately his OS's UTF-8 locale is broken. He decided to use C locale and want to sort the result from SELECT like this. SELECT * FROM japanese_table ORDER BY convert(t, 'EUC_JP'); Note that sorting by UTF-8 physical order would produce random results. So following would not help him in this case: SELECT * FROM japanese_table ORDER BY t; Also I don't understand what this is different to the problem when we have a message catalogue which does not match the encoding. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
> Tatsuo Ishii wrote: > > > BTW, every encoding has its own charset. However the relationship > > between encoding and charset are not so simple as Unicode. For > > example, encoding EUC_JP correponds to multiple charsets, namely > > ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which > > returns a "code point" is not quite usefull since it lacks the charset > > info. I think we need to continute design discussion, probably > > targetting for 8.4, not 8.3. > > Is Unicode complete as far as Japanese chars go? I mean, is there a > character in EUC_JP that is not representable in Unicode? I don't think Unicode is "complete" in this case. Problems are: EUC_JP allows user defined characters which are not mapped to Unicode. Also some characters in EUC_JP corresponds to multiple Unicode points. > Because if Unicode is complete, ISTM it makes perfect sense to have a > unicode_char() (or whatever we end up calling it) that takes an Unicode > code point and returns a character in whatever JIS set you want > (specified by setting client_encoding to that). Because then you solved > the problem nicely. I'm not sure what kind of use case for unicode_char() you are thinking about. Anyway if you want a "code point" from a character, we could easily add such functions to all backend encodings currently we support. Probably it would look like: to_code_point(str TEXT) returns TEXT An example outputs are: ASCII - 41 ISO 10646 - U+0041 ISO 10646 - U+29E3D ISO 8859-1 - a5 JIS X 0208 - 4141 It's a little bit too late for 8.2 though. > One thing that I find confusing in your text above is whether EUC_JP is > an encoding or a charset? I would think that the various JIS X are > encodings, and EUC_JP is the charset; or is it the other way around? No, EUC_JP is an encoding. JIS X are the charsets. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug in UTF8-Validation Code?
> Tatsuo Ishii wrote: > > > . I think we need to continute design discussion, probably > > targetting for 8.4, not 8.3. > > The discussion came about because Andrew - Supernews noticed that chr() > returns invalid utf8, and we're trying to fix all the bugs with invalid > utf8 in the system. Something needs to be done, even if we just check > the result of the current chr() implementation and throw an error on > invalid results. But do we want to make this minor change for 8.3 and > then change it again for 8.4? My opinion was in the snipped part by you in the previous mail -- Limiting chr() to ASCII range -- Tatsuo Ishii SRA OSS, Inc. Japan > Here's an example of the current problem. It's an 8.2.3 database with > utf8.en_US encoding > > > mark=# create table testutf8 (t text); > CREATE TABLE > mark=# insert into testutf8 (t) (select chr(gs) from > generate_series(0,255) as gs); > INSERT 0 256 > mark=# \copy testutf8 to testutf8.data > mark=# truncate testutf8; > TRUNCATE TABLE > mark=# \copy testutf8 from testutf8.data > ERROR: invalid byte sequence for encoding "UTF8": 0x80 > HINT: This error can also happen if the byte sequence does not match > the encoding expected by the server, which is controlled by > "client_encoding". > CONTEXT: COPY testutf8, line 129 > > > > ---(end of broadcast)--- > TIP 9: In versions below 8.0, the planner will ignore your desire to >choose an index scan if your joining column's datatypes do not >match ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger <[EMAIL PROTECTED]> writes: > Albe Laurenz wrote: >> 0x00 is a valid UNICODE code point and also a valid UTF-8 character! > It's not my code that rejects this. I'm passing the resultant string to > the pg_verify_mbstr(...) function and it is rejecting a null. I could > change that, of course, but if other parts of the system found it > reasonable to reject null, why should chr() be different? chr() really should reject that if it's going to do any error checking at all, because text operations tend to misbehave on embedded nulls. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug in UTF8-Validation Code?
Tatsuo Ishii wrote: . I think we need to continute design discussion, probably targetting for 8.4, not 8.3. The discussion came about because Andrew - Supernews noticed that chr() returns invalid utf8, and we're trying to fix all the bugs with invalid utf8 in the system. Something needs to be done, even if we just check the result of the current chr() implementation and throw an error on invalid results. But do we want to make this minor change for 8.3 and then change it again for 8.4? Here's an example of the current problem. It's an 8.2.3 database with utf8.en_US encoding mark=# create table testutf8 (t text); CREATE TABLE mark=# insert into testutf8 (t) (select chr(gs) from generate_series(0,255) as gs); INSERT 0 256 mark=# \copy testutf8 to testutf8.data mark=# truncate testutf8; TRUNCATE TABLE mark=# \copy testutf8 from testutf8.data ERROR: invalid byte sequence for encoding "UTF8": 0x80 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". CONTEXT: COPY testutf8, line 129 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
Albe Laurenz wrote: There's one thing that strikes me as weird in your implementation: pgsql=# select chr(0); ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8" 0x00 is a valid UNICODE code point and also a valid UTF-8 character! It's not my code that rejects this. I'm passing the resultant string to the pg_verify_mbstr(...) function and it is rejecting a null. I could change that, of course, but if other parts of the system found it reasonable to reject null, why should chr() be different? This was also one of the questions I asked upthread. Does changing the behavior of chr() break people's reasonable expectations? mark ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
Am Mittwoch, 4. April 2007 16:22 schrieb Tom Lane: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > Right -- IMHO what we should be doing is reject any input to chr() which > > is beyond plain ASCII (or maybe > 255), and create a separate function > > (unicode_char() sounds good) to get an Unicode character from a code > > point, converted to the local client_encoding per conversion_procs. > > Hm, I hadn't thought of that approach, but another idea is that the > argument of chr() is *always* a unicode code point, and it converts > to the current encoding. Do we really need a separate function? The SQL standard has a "Unicode character string literal", which looks like this: U&'The price is 100 \20AC.' This is similar in spirit to our current escape mechanism available via E'...' which, however, produces bytes. It has the advantage over a chr()-based mechanism that the composition of strings doesn't require an ugly chain of literals, functions, and concatenations. Implementing this would, however, be a bit tricky because you don't have access to the encoding conversion functions in the lexer. You would probably have to map that to a function call an evaluate it later. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug in UTF8-Validation Code?
Tatsuo Ishii wrote: > BTW, every encoding has its own charset. However the relationship > between encoding and charset are not so simple as Unicode. For > example, encoding EUC_JP correponds to multiple charsets, namely > ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which > returns a "code point" is not quite usefull since it lacks the charset > info. I think we need to continute design discussion, probably > targetting for 8.4, not 8.3. Is Unicode complete as far as Japanese chars go? I mean, is there a character in EUC_JP that is not representable in Unicode? Because if Unicode is complete, ISTM it makes perfect sense to have a unicode_char() (or whatever we end up calling it) that takes an Unicode code point and returns a character in whatever JIS set you want (specified by setting client_encoding to that). Because then you solved the problem nicely. One thing that I find confusing in your text above is whether EUC_JP is an encoding or a charset? I would think that the various JIS X are encodings, and EUC_JP is the charset; or is it the other way around? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
> Alvaro Herrera <[EMAIL PROTECTED]> writes: > > Right -- IMHO what we should be doing is reject any input to chr() which > > is beyond plain ASCII (or maybe > 255), and create a separate function > > (unicode_char() sounds good) to get an Unicode character from a code > > point, converted to the local client_encoding per conversion_procs. > > Hm, I hadn't thought of that approach, but another idea is that the > argument of chr() is *always* a unicode code point, and it converts > to the current encoding. Do we really need a separate function? To be honest, I don't really see why we need to rush to add such Unicode(I assume we are reffering to "Unicode" as ISO 10646) specialized functions at this point. Limiting chr() to ASCII range is enough, I think. BTW, every encoding has its own charset. However the relationship between encoding and charset are not so simple as Unicode. For example, encoding EUC_JP correponds to multiple charsets, namely ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which returns a "code point" is not quite usefull since it lacks the charset info. I think we need to continute design discussion, probably targetting for 8.4, not 8.3. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
On Wed, Apr 04, 2007 at 10:22:28AM -0400, Tom Lane wrote: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > Right -- IMHO what we should be doing is reject any input to chr() which > > is beyond plain ASCII (or maybe > 255), and create a separate function > > (unicode_char() sounds good) to get an Unicode character from a code > > point, converted to the local client_encoding per conversion_procs. > > Hm, I hadn't thought of that approach, but another idea is that the > argument of chr() is *always* a unicode code point, and it converts > to the current encoding. Do we really need a separate function? That's what I'd advocate, but then we're not Oracle compatable... Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Bug in UTF8-Validation Code?
Andrew - Supernews <[EMAIL PROTECTED]> writes: > Thinking about this made me realize that there's another, ahem, elephant > in the room here: convert(). > By definition convert() returns text strings which are not valid in the > server encoding. How can this be addressed? Remove convert(). Or at least redefine it as dealing in bytea not text. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug in UTF8-Validation Code?
Alvaro Herrera <[EMAIL PROTECTED]> writes: > Right -- IMHO what we should be doing is reject any input to chr() which > is beyond plain ASCII (or maybe > 255), and create a separate function > (unicode_char() sounds good) to get an Unicode character from a code > point, converted to the local client_encoding per conversion_procs. Hm, I hadn't thought of that approach, but another idea is that the argument of chr() is *always* a unicode code point, and it converts to the current encoding. Do we really need a separate function? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
On 2007-04-04, Alvaro Herrera <[EMAIL PROTECTED]> wrote: > Right -- IMHO what we should be doing is reject any input to chr() which > is beyond plain ASCII (or maybe > 255), and create a separate function > (unicode_char() sounds good) to get an Unicode character from a code > point, converted to the local client_encoding per conversion_procs. Thinking about this made me realize that there's another, ahem, elephant in the room here: convert(). By definition convert() returns text strings which are not valid in the server encoding. How can this be addressed? -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug in UTF8-Validation Code?
Martijn van Oosterhout wrote: > On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote: > > I think it's probably defensible for non-Unicode encodings. To do > > otherwise would require (a) figuring out what the equivalent concept to > > "code point" is for each encoding, and (b) having a separate code path > > for each encoding to perform the mapping. It's not clear that there > > even is an answer to (a), and (b) seems like more work than chr() is > > worth. But we know what the right way is for Unicode, so we should > > special case that one. > > I dunno. I find it odd that if I want a pl/pgsql function to return a > Euro symbol, it has to know what encoding the DB is in. Though I > suppose that would call for a unicode_chr() function. Right -- IMHO what we should be doing is reject any input to chr() which is beyond plain ASCII (or maybe > 255), and create a separate function (unicode_char() sounds good) to get an Unicode character from a code point, converted to the local client_encoding per conversion_procs. So if I'm in Latin-1 and ask for the Euro sign, this should fail because Latin-1 does not have the euro sign. If I'm in Latin-9 I should get the Euro. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in UTF8-Validation Code?
> > When the database uses a single byte encoding, the chr function takes > > the binary byte representation as an integer number between 0 and 255 > > (e.g. ascii code). > > When the database encoding is one of the unicode encodings it takes a > > unicode code point. > > This is also what Oracle does. > > Sorry, but this is *NOT* what Oracle does. > At least if we can agree that the code point for the Euro > sign is 0x20AC. yes > > SQL> SELECT ASCII('EUR') AS DEC, > 2 TO_CHAR(ASCII('EUR'), 'XX') AS HEX > 3 FROM DUAL; > >DEC HEX > -- > 14844588 E282AC > > The encoding in this example is AL32UTF8, which corresponds > to our UTF8. You are right, I am sorry. My test was broken. To get the euro symbol in Oracle with a AL32UTF8 encoding you use chr(14844588) Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
> When the database uses a single byte encoding, the chr function takes > the binary byte representation as an integer number between 0 and 255 > (e.g. ascii code). > When the database encoding is one of the unicode encodings it takes a > unicode code point. > This is also what Oracle does. Sorry, but this is *NOT* what Oracle does. At least if we can agree that the code point for the Euro sign is 0x20AC. SQL> SELECT ASCII('EUR') AS DEC, 2 TO_CHAR(ASCII('EUR'), 'XX') AS HEX 3 FROM DUAL; DEC HEX -- 14844588 E282AC The encoding in this example is AL32UTF8, which corresponds to our UTF8. Yours, Laurenz Albe ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug in UTF8-Validation Code?
> What do others think? Should the argument to CHR() be a > Unicode code point or the numeric representation of the > database encoding? When the database uses a single byte encoding, the chr function takes the binary byte representation as an integer number between 0 and 255 (e.g. ascii code). When the database encoding is one of the unicode encodings it takes a unicode code point. This is also what Oracle does. Not sure what to do with other multibyte encodings. Oracle only states that the numeric argument must resolve to one entire code point, whatever that is. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger wrote: >> What I suggest (and what Oracle implements, and isn't CHR() and ASCII() >> partly for Oracle compatibility?) is that CHR() and ASCII() >> convert between a character (in database encoding) and >> that database encoding in numeric form. > > Looking at Oracle documentation, it appears that you get different > behavior from CHR(X [USING NCHAR_CS]) depending on whether you call it > with the argument USING NCHAR_CS. Oracle 9i and higher have an > additional function called NCHR(X) which is supposed to be the same as > CHR(X USING NCHAR_CS). > > On http://www.oraclehome.co.uk/chr-function.htm it says that "To use > UTF8, you specify using nchar_cs in the argument list". Does this mean > that CHR(X) behaves as Tom Lane wants, and NCHR(X) behaves as Albe > Laurenz wants? Vice versa? That web page is misleading at least, if not downright wrong. It's just that an Oracle database has 2 character sets, a "database character set" and a "national character set", the latter always being a UNICODE encoding (the name "national character set" is somewhat misleading). This baroque concept is from those days when nobody had a UNICODE database, but people still wanted to store characters not supported by the "database character set" - in that case you could define a column to be in the "national character set". CHR(n) and CHR(n USING NCHAR_CS) = NCHR(n) are the same function, only that the first one uses the "database character set" and the latter ones the "national character set". Nowadays this Oracle concept of "national character set" is nearly obsolete, one normally uses a UNICODE "database character set". Oracle has two things to say about CHR(): "For single-byte character sets, if n > 256, then Oracle Database returns the binary equivalent of n mod 256. For multibyte character sets, n must resolve to one entire code point. Invalid code points are not validated, and the result of specifying invalid code points is indeterminate." It seems that Oracle means "encoding" when it says "code point" :^) We should of course reject invalid arguments! I don't know if I like this modulus thing for single byte encodings or not... "Use of the CHR function (either with or without the optional USING NCHAR_CS clause) results in code that is not portable between ASCII- and EBCDIC-based machine architectures." There's one thing that strikes me as weird in your implementation: > pgsql=# select chr(0); > ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8" 0x00 is a valid UNICODE code point and also a valid UTF-8 character! To me (maybe only to me) CHR() and ASCII() have always had the look and feel of "type casts" between "char" and integer, with all the lack of portability this might imply. Yours, Laurenz Albe ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote: > I think it's probably defensible for non-Unicode encodings. To do > otherwise would require (a) figuring out what the equivalent concept to > "code point" is for each encoding, and (b) having a separate code path > for each encoding to perform the mapping. It's not clear that there > even is an answer to (a), and (b) seems like more work than chr() is > worth. But we know what the right way is for Unicode, so we should > special case that one. I dunno. I find it odd that if I want a pl/pgsql function to return a Euro symbol, it has to know what encoding the DB is in. Though I suppose that would call for a unicode_chr() function. Is there any multibyte mapping other than unicode that distinguishes between the character set and the encoding thereof? Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Bug in UTF8-Validation Code?
Albe Laurenz wrote: What I suggest (and what Oracle implements, and isn't CHR() and ASCII() partly for Oracle compatibility?) is that CHR() and ASCII() convert between a character (in database encoding) and that database encoding in numeric form. Looking at Oracle documentation, it appears that you get different behavior from CHR(X [USING NCHAR_CS]) depending on whether you call it with the argument USING NCHAR_CS. Oracle 9i and higher have an additional function called NCHR(X) which is supposed to be the same as CHR(X USING NCHAR_CS). On http://www.oraclehome.co.uk/chr-function.htm it says that "To use UTF8, you specify using nchar_cs in the argument list". Does this mean that CHR(X) behaves as Tom Lane wants, and NCHR(X) behaves as Albe Laurenz wants? Vice versa? I'm not saying that Oracle compatibility is paramount. But if we can get compatibility and a reasonable implementation at the same time, that seems like a bonus. Once again, I don't have Oracle installed and cannot test this :( mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger <[EMAIL PROTECTED]> writes: > Martijn van Oosterhout wrote: >> Just about every multibyte encoding other than Unicode has the problem >> of not distinguishing between the code point and the encoding of it. > Thanks for the feedback. Would you say that the way I implemented things in > the > example code would be correct for multibyte non Unicode encodings? I think it's probably defensible for non-Unicode encodings. To do otherwise would require (a) figuring out what the equivalent concept to "code point" is for each encoding, and (b) having a separate code path for each encoding to perform the mapping. It's not clear that there even is an answer to (a), and (b) seems like more work than chr() is worth. But we know what the right way is for Unicode, so we should special case that one. Note the points made that in all cases ascii() and chr() should be inverses, and that you shouldn't just fall back to the old behavior in SQL_ASCII encoding. (My vote for SQL_ASCII would be to reject values > 255.) regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
Martijn van Oosterhout wrote: On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote: IMHO this is the only good and intuitive way for CHR() and ASCII(). Hardly. The comment earlier about mbtowc was much closer to the mark. And wide characters are defined as Unicode points. Basically, CHR() takes a unicode point and returns that character in a string appropriately encoded. ASCII() does the reverse. Just about every multibyte encoding other than Unicode has the problem of not distinguishing between the code point and the encoding of it. Unicode is a collection of encodings based on the same set. Have a nice day, Thanks for the feedback. Would you say that the way I implemented things in the example code would be correct for multibyte non Unicode encodings? I don't see how to avoid the endianness issue for those encodings. mark ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
Andrew wrote: >> According to RFC 2279, the Euro, >> Unicode code point 0x20AC = 0010 1010 1100, >> will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC. >> >> IMHO this is the only good and intuitive way for CHR() and ASCII(). > > It is beyond ludicrous for functions like chr() or ascii() to > convert a Euro sign to 0xE282AC rather than 0x20AC. "Intuitive"? There > is _NO SUCH THING_ as 0xE282AC as a representation of a Unicode character > - there is either the code point, 0x20AC (which is a _number_), or the > sequences of _bytes_ that represent that code point in various encodings, > of which the three-byte sequence 0xE2 0x82 0xAC is the one used in UTF-8. Yes, 0xE2 0x82 0xAC is the representation in UTF-8, and UTF-8 is the database encoding in use. > Functions like chr() and ascii() should be dealing with the _number_ of the > code point, not with its representation in transfer encodings. I think that we have a fundamental difference. As far as I know, the word "code point" is only used in UNICODE and is the first column in the list http://www.unicode.org/Public/UNIDATA/UnicodeData.txt So, if I understand you correctly, you want CHR() and ASCII() to convert between characters (in the current database encoding) and UNICODE code points (independent of database encoding). What I suggest (and what Oracle implements, and isn't CHR() and ASCII() partly for Oracle compatibility?) is that CHR() and ASCII() convert between a character (in database encoding) and that database encoding in numeric form. I think that what you suggest would be a useful function too, but I certainly wouldn't call such a function ASCII() :^) The current implementation seems closer to my idea of ASCII(), only incomplete: test=> select to_hex(ascii('EUR')); to_hex e2 (1 row) What do others think? Should the argument to CHR() be a Unicode code point or the numeric representation of the database encoding? Yours, Laurenz Albe ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote: > IMHO this is the only good and intuitive way for CHR() and ASCII(). Hardly. The comment earlier about mbtowc was much closer to the mark. And wide characters are defined as Unicode points. Basically, CHR() takes a unicode point and returns that character in a string appropriately encoded. ASCII() does the reverse. Just about every multibyte encoding other than Unicode has the problem of not distinguishing between the code point and the encoding of it. Unicode is a collection of encodings based on the same set. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Bug in UTF8-Validation Code?
On 2007-04-03, "Albe Laurenz" <[EMAIL PROTECTED]> wrote: > According to RFC 2279, the Euro, > Unicode code point 0x20AC = 0010 1010 1100, > will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC. > > IMHO this is the only good and intuitive way for CHR() and ASCII(). It is beyond ludicrous for functions like chr() or ascii() to convert a Euro sign to 0xE282AC rather than 0x20AC. "Intuitive"? There is _NO SUCH THING_ as 0xE282AC as a representation of a Unicode character - there is either the code point, 0x20AC (which is a _number_), or the sequences of _bytes_ that represent that code point in various encodings, of which the three-byte sequence 0xE2 0x82 0xAC is the one used in UTF-8. Functions like chr() and ascii() should be dealing with the _number_ of the code point, not with its representation in transfer encodings. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger wrote: >>> In particular, in UTF8 land I'd have expected the argument of chr() >>> to be interpreted as a Unicode code point, not as actual UTF8 bytes >>> with a randomly-chosen endianness. >>> >>> Not sure what to do in other multibyte encodings. >> >> "Not sure what to do in other multibyte encodings" was pretty much my >> rationale for this particular behavior. I standardized on network byte >> order because there are only two endianesses to choose from, and the >> other seems to be a more surprising choice. > > Since chr() is defined in oracle_compat.c, I decided to look > at what Oracle might do. See > http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/func tions18a.htm > > It looks to me like they are doing the same thing that I did, > though I don't have Oracle installed anywhere to verify that. > Is there a difference? This is Oracle 10.2.0.3.0 ("latest and greatest") with UTF-8 encoding (actually, Oracle chooses to call this encoding AL32UTF8): SQL> SELECT ASCII('EUR') AS DEC, 2 TO_CHAR(ASCII('EUR'), 'XX') AS HEX 3 FROM DUAL; DEC HEX -- 14844588 E282AC SQL> SELECT CHR(14844588) AS EURO FROM DUAL; EURO EUR I don't see how endianness enters into this at all - isn't that just the question of how a byte is stored physically? According to RFC 2279, the Euro, Unicode code point 0x20AC = 0010 1010 1100, will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC. IMHO this is the only good and intuitive way for CHR() and ASCII(). Yours, Laurenz Albe ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug in UTF8-Validation Code?
On 2007-04-02, Mark Dilger <[EMAIL PROTECTED]> wrote: > Here's the code for the new chr() function: > > if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) Clearly wrong - this allows returning invalid UTF8 data in locale C, which is not an uncommon setting to use. Treating the parameter as bytes is wrong too - it should correspond to whatever the natural character numbering for the encoding is; for utf8 that is the Unicode code point. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger wrote: Since chr() is defined in oracle_compat.c, I decided to look at what Oracle might do. See http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions18a.htm It looks to me like they are doing the same thing that I did, though I don't have Oracle installed anywhere to verify that. Is there a difference? Reading that page again, I think I'd have to use mbrtowc() or similar in the spot where I'm currently just using the literal utf8 string. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger wrote: Tom Lane wrote: Mark Dilger <[EMAIL PROTECTED]> writes: pgsql=# select chr(14989485); chr - ä¸ (1 row) Is there a principled rationale for this particular behavior as opposed to any other? In particular, in UTF8 land I'd have expected the argument of chr() to be interpreted as a Unicode code point, not as actual UTF8 bytes with a randomly-chosen endianness. Not sure what to do in other multibyte encodings. "Not sure what to do in other multibyte encodings" was pretty much my rationale for this particular behavior. I standardized on network byte order because there are only two endianesses to choose from, and the other seems to be a more surprising choice. I looked around on the web for a standard for how to convert an integer into a valid multibyte character and didn't find anything. Andrew, Supernews has said upthread that chr() is clearly wrong and needs to be fixed. If so, we need some clear definition what "fixed" means. Any suggestions? mark Since chr() is defined in oracle_compat.c, I decided to look at what Oracle might do. See http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions18a.htm It looks to me like they are doing the same thing that I did, though I don't have Oracle installed anywhere to verify that. Is there a difference? mark ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger wrote: Tom Lane wrote: Mark Dilger <[EMAIL PROTECTED]> writes: pgsql=# select chr(14989485); chr - ä¸ (1 row) Is there a principled rationale for this particular behavior as opposed to any other? In particular, in UTF8 land I'd have expected the argument of chr() to be interpreted as a Unicode code point, not as actual UTF8 bytes with a randomly-chosen endianness. Not sure what to do in other multibyte encodings. "Not sure what to do in other multibyte encodings" was pretty much my rationale for this particular behavior. I standardized on network byte order because there are only two endianesses to choose from, and the other seems to be a more surprising choice. I looked around on the web for a standard for how to convert an integer into a valid multibyte character and didn't find anything. Andrew, Supernews has said upthread that chr() is clearly wrong and needs to be fixed. If so, we need some clear definition what "fixed" means. Any suggestions? mark Another issue to consider when thinking about the corect definition of chr() is that ascii(chr(X)) = X. This gets weird if X is greater than 255. If nothing else, the name "ascii" is no longer appropriate. mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
Tom Lane wrote: Mark Dilger <[EMAIL PROTECTED]> writes: pgsql=# select chr(14989485); chr - ä¸ (1 row) Is there a principled rationale for this particular behavior as opposed to any other? In particular, in UTF8 land I'd have expected the argument of chr() to be interpreted as a Unicode code point, not as actual UTF8 bytes with a randomly-chosen endianness. Not sure what to do in other multibyte encodings. "Not sure what to do in other multibyte encodings" was pretty much my rationale for this particular behavior. I standardized on network byte order because there are only two endianesses to choose from, and the other seems to be a more surprising choice. I looked around on the web for a standard for how to convert an integer into a valid multibyte character and didn't find anything. Andrew, Supernews has said upthread that chr() is clearly wrong and needs to be fixed. If so, we need some clear definition what "fixed" means. Any suggestions? mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger <[EMAIL PROTECTED]> writes: >> pgsql=# select chr(14989485); >> chr >> - >> ä¸ >> (1 row) Is there a principled rationale for this particular behavior as opposed to any other? In particular, in UTF8 land I'd have expected the argument of chr() to be interpreted as a Unicode code point, not as actual UTF8 bytes with a randomly-chosen endianness. Not sure what to do in other multibyte encodings. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger wrote: Andrew - Supernews wrote: On 2007-04-01, Mark Dilger <[EMAIL PROTECTED]> wrote: Do any of the string functions (see http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the risk of generating invalid utf8 encoded strings? Do I need to add checks? Are there known bugs with these functions in this regard? The chr() function returns an octet, rather than a character; this is clearly wrong and needs fixing. Ok, I've altered the chr() function. I am including a transcript from psql below. There are several design concerns: 1) In the current implementation, chr(0) returns a 5-byte text object (4-bytes of overhead plus one byte of data) containing a null. In the new implementation, this returns an error. I don't know, but it is possible that people currently use things like "SELECT chr(0) || chr(0) || ..." to build up strings of nulls. 2) Under utf8, chr(X) fails for X = 128..255. This may also break current users expectations. 3) The implicit modulus operation that was being performed by chr() is now gone, which might break some users. 4) You can't represent the high end of the astral plain with type INTEGER, unless you pass in a negative value, which is somewhat unintuitive. Since chr() expects an integer (and not a bigint) the user needs handle the sign bit correctly. mark - Welcome to psql 8.3devel, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit pgsql=# select chr(0); ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8" pgsql=# select chr(65); chr - A (1 row) pgsql=# select chr(128); ERROR: character 0x80 of encoding "SQL_ASCII" has no equivalent in "UTF8" pgsql=# select chr(53398); chr - Ж (1 row) pgsql=# select chr(14989485); chr - 中 (1 row) pgsql=# select chr(4036005254); ERROR: function chr(bigint) does not exist LINE 1: select chr(4036005254); ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts. Here's the code for the new chr() function: Datum chr(PG_FUNCTION_ARGS) { int32 cvalue = PG_GETARG_INT32(0); text *result; if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) { int encoding, len, byteoff; uint32 buf[2]; const char *bufptr; encoding = GetDatabaseEncoding(); buf[0] = htonl(cvalue); buf[1] = 0; bufptr = (const char *)&buf; for (byteoff = 0; byteoff < sizeof(uint32) && 0 == *bufptr; ++byteoff, ++bufptr); len = pg_encoding_mblen(encoding,bufptr); if (byteoff + len != sizeof(uint32) || !pg_verify_mbstr(encoding, bufptr, len, true /* noError */)) report_untranslatable_char(PG_SQL_ASCII, encoding, bufptr, sizeof(int32)); result = (text *) palloc(VARHDRSZ + len); SET_VARSIZE(result, VARHDRSZ + len); memcpy(VARDATA(result),bufptr,len); } else { result = (text *) palloc(VARHDRSZ + 1); SET_VARSIZE(result, VARHDRSZ + 1); *VARDATA(result) = (char) cvalue; } PG_RETURN_TEXT_P(result); } ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
Andrew - Supernews wrote: On 2007-04-01, Mark Dilger <[EMAIL PROTECTED]> wrote: Do any of the string functions (see http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the risk of generating invalid utf8 encoded strings? Do I need to add checks? Are there known bugs with these functions in this regard? The chr() function returns an octet, rather than a character; this is clearly wrong and needs fixing. Ok, I've altered the chr() function. I am including a transcript from psql below. There are several design concerns: 1) In the current implementation, chr(0) returns a 5-byte text object (4-bytes of overhead plus one byte of data) containing a null. In the new implementation, this returns an error. I don't know, but it is possible that people currently use things like "SELECT chr(0) || chr(0) || ..." to build up strings of nulls. 2) Under utf8, chr(X) fails for X = 128..255. This may also break current users expectations. 3) The implicit modulus operation that was being performed by chr() is now gone, which might break some users. 4) You can't represent the high end of the astral plain with type INTEGER, unless you pass in a negative value, which is somewhat unintuitive. Since chr() expects an integer (and not a bigint) the user needs handle the sign bit correctly. mark - Welcome to psql 8.3devel, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit pgsql=# select chr(0); ERROR: character 0x00 of encoding "SQL_ASCII" has no equivalent in "UTF8" pgsql=# select chr(65); chr - A (1 row) pgsql=# select chr(128); ERROR: character 0x80 of encoding "SQL_ASCII" has no equivalent in "UTF8" pgsql=# select chr(53398); chr - Ж (1 row) pgsql=# select chr(14989485); chr - 中 (1 row) pgsql=# select chr(4036005254); ERROR: function chr(bigint) does not exist LINE 1: select chr(4036005254); ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
Tatsuo Ishii <[EMAIL PROTECTED]> writes: >> No, we've never exported those with the intent that client code should >> use 'em. > I thought PQescapeString() of 8.3 uses mbverify functions to make sure > that user supplied multibyte string is valid. Certainly --- but we can change PQescapeString to match whatever we do with the pg_wchar functions. The question was whether we intend to support client application code (outside libpq) using those functions. That's definitely not the intent. exports.txt lists only PQmblen and pg_utf_mblen as exported (and I have to wonder why the latter is separately exported...), which means that client code on modern platforms isn't even capable of getting at the others. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
> Mark Dilger <[EMAIL PROTECTED]> writes: > > Refactoring the way these table driven functions work would impact > > lots of other code. Just grep for all files #including mb/pg_wchar.h > > for the list of them. The list includes interfaces/libpq, and I'm > > wondering if software that links against postgres might rely on these > > function prototypes? > > No, we've never exported those with the intent that client code should > use 'em. Doing so would require importing non-public headers, and > anyone who does that can have no grounds for complaining if the headers > change incompatibly. I thought PQescapeString() of 8.3 uses mbverify functions to make sure that user supplied multibyte string is valid. -- Tatsuo Ishii SRA OSS, Inc. Japan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger <[EMAIL PROTECTED]> writes: > Refactoring the way these table driven functions work would impact > lots of other code. Just grep for all files #including mb/pg_wchar.h > for the list of them. The list includes interfaces/libpq, and I'm > wondering if software that links against postgres might rely on these > function prototypes? No, we've never exported those with the intent that client code should use 'em. Doing so would require importing non-public headers, and anyone who does that can have no grounds for complaining if the headers change incompatibly. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in UTF8-Validation Code?
Martijn van Oosterhout wrote: There's also the performance angle. The current mbverify is very inefficient for encodings like UTF-8. You might need to refactor a bit there... There appears to be a lot of function call overhead in the current implementation. In pg_verify_mbstr, the function pointer pg_wchar_table.mbverify is called for each multibyte character in a multibyte string. Refactoring the way these table driven functions work would impact lots of other code. Just grep for all files #including mb/pg_wchar.h for the list of them. The list includes interfaces/libpq, and I'm wondering if software that links against postgres might rely on these function prototypes? mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
On 2007-04-01, Mark Dilger <[EMAIL PROTECTED]> wrote: > Do any of the string functions (see > http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the > risk of generating invalid utf8 encoded strings? Do I need to add checks? > Are there known bugs with these functions in this regard? The chr() function returns an octet, rather than a character; this is clearly wrong and needs fixing. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
On Sat, Mar 31, 2007 at 07:47:21PM -0700, Mark Dilger wrote: > OK, I can take a stab at fixing this. I'd like to state some assumptions > so people can comment and reply: > > I assume that I need to fix *all* cases where invalid byte encodings get > into the database through functions shipped in the core distribution. Yes. > I assume I do not need to worry about people getting bad data into the > system through their own database extensions. That'd be rather difficult :) > I assume that the COPY problem discussed up-thread goes away once you > eliminate all the paths by which bad data can get into the system. > However, existing database installations with bad data already loaded will > not be magically fixed with these code patches. Correct. > Do any of the string functions (see > http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run > the risk of generating invalid utf8 encoded strings? Do I need to add > checks? Are there known bugs with these functions in this regard? I don't think so. They'd be bugs if they were... > If not, I assume I can add mbverify calls to the various input routines > (textin, varcharin, etc) where invalid utf8 could otherwise enter the > system. The only hard part is handling where the escaping and unescaping is happening... > I assume that this work can be limited to HEAD and that I don't need to > back-patch it. (I suspect this assumption is a contentious one.) At the very least I'd start with HEAD. Whether it gets backpatched probably depends on how invasive it ends up being... There's also the performance angle. The current mbverify is very inefficient for encodings like UTF-8. You might need to refactor a bit there... Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Bug in UTF8-Validation Code?
Mark Dilger wrote: Bruce Momjian wrote: Added to TODO: * Fix cases where invalid byte encodings are accepted by the database, but throw an error on SELECT http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php Is anyone working on fixing this bug? Hi, has anyone volunteered to fix this bug? I did not see any reply on the mailing list to your question above. mark OK, I can take a stab at fixing this. I'd like to state some assumptions so people can comment and reply: I assume that I need to fix *all* cases where invalid byte encodings get into the database through functions shipped in the core distribution. I assume I do not need to worry about people getting bad data into the system through their own database extensions. I assume that the COPY problem discussed up-thread goes away once you eliminate all the paths by which bad data can get into the system. However, existing database installations with bad data already loaded will not be magically fixed with these code patches. Do any of the string functions (see http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the risk of generating invalid utf8 encoded strings? Do I need to add checks? Are there known bugs with these functions in this regard? If not, I assume I can add mbverify calls to the various input routines (textin, varcharin, etc) where invalid utf8 could otherwise enter the system. I assume that this work can be limited to HEAD and that I don't need to back-patch it. (I suspect this assumption is a contentious one.) Advice and comments are welcome, mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
Bruce Momjian wrote: Added to TODO: * Fix cases where invalid byte encodings are accepted by the database, but throw an error on SELECT http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php Is anyone working on fixing this bug? Hi, has anyone volunteered to fix this bug? I did not see any reply on the mailing list to your question above. mark ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug in UTF8-Validation Code?
Added to TODO: * Fix cases where invalid byte encodings are accepted by the database, but throw an error on SELECT http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php Is anyone working on fixing this bug? --- Mario Weilguni wrote: > Hi, > > I've a problem with a database, I can dump the database to a file, but > restoration fails, happens with 8.1.4. > > Steps to reproduce: > create database testdb with encoding='UTF8'; > \c testdb > create table test(x text); > insert into test values ('\244'); ==> Is akzepted, even if not UTF8. > > pg_dump testdb -f testdb.dump -Fc > pg_restore -f testdb.dump -d testdb => fails with an error: > ERROR: invalid byte sequence for encoding "UTF8": 0xa4 > > The problem itself comes from a CSV file, which is imported with \copy > without > proper quoting (so I have to fix this anyway), but I still think this is an > error, making restoration very complicated in such cases... > > Or am I doing something completly wrong here? > > Best regards, > Mario Weilguni > > > ---(end of broadcast)--- > TIP 9: In versions below 8.0, the planner will ignore your desire to >choose an index scan if your joining column's datatypes do not >match -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
Am Sonntag, 18. März 2007 12:36 schrieb Martijn van Oosterhout: > > It seems to me that the easiest solution would be to forbid \x?? escape > sequences where it's greater than \x7F for UTF-8 server encodings. > Instead introduce a \u escape for specifying the unicode character > directly. Under the basic principle that any escape sequence still has > to represent a single character. The result can be multiple bytes, but > you don't have to check for consistancy anymore. Would'nt the best solution to change the order of validation/convert_backslash_sequences better? First convert the sequences, and reject them in the validation stage? Regards Mario Weilguni ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug in UTF8-Validation Code?
Jeff Davis <[EMAIL PROTECTED]> wrote: > Some people think it's a bug, some people don't. It is technically > documented behavior, but I don't think the documentation is clear > enough. I think it is a bug that should be fixed, and here's another > message in the thread that expresses my opinion: Agreed. I think it is a bug, too. Insertion of invalid characters makes read queries busted. $ initdb --encoding=utf8 # CREATE TABLE test (t text); # INSERT INTO test VALUES('A'); # SELECT * FROM test; t --- A (1 row) # INSERT INTO test VALUES(E'\200'); # SELECT * FROM test; ERROR: invalid byte sequence for encoding "UTF8": 0x80 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is cont rolled by "client_encoding". Could it lead to DoS? http://www.postgresql.org/support/security | [D] A vulnerability that is exploitable for denial-of-service, | but requiring a valid prior login. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug in UTF8-Validation Code?
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Ok, good point. Now, which of those need to have a check for valid encoding? The vast majority will barf on any non-ASCII character anyway ... only the ones that don't will need a check. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
Gregory Stark wrote: "Andrew Dunstan" <[EMAIL PROTECTED]> writes: Below is a list of the input routines in the adt directory, courtesy of grep. Grep isn't a good way to get these, your list missed a bunch. postgres=# select distinct prosrc from pg_proc where oid in (select typinput from pg_type); [snip] (64 rows) Ok, good point. Now, which of those need to have a check for valid encoding? cheers andrew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
"Andrew Dunstan" <[EMAIL PROTECTED]> writes: > Below is a list of the input routines in the adt directory, courtesy of grep. Grep isn't a good way to get these, your list missed a bunch. postgres=# select distinct prosrc from pg_proc where oid in (select typinput from pg_type); prosrc - abstimein aclitemin any_in anyarray_in anyelement_in array_in bit_in boolin box_in bpcharin byteain cash_in charin cidin cidr_in circle_in cstring_in date_in domain_in float4in float8in inet_in int2in int2vectorin int4in int8in internal_in interval_in language_handler_in line_in lseg_in macaddr_in namein numeric_in oidin oidvectorin opaque_in path_in point_in poly_in record_in regclassin regoperatorin regoperin regprocedurein regprocin regtypein reltimein smgrin textin tidin time_in timestamp_in timestamptz_in timetz_in tintervalin trigger_in unknownin uuid_in varbit_in varcharin void_in xidin xml_in (64 rows) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in UTF8-Validation Code?
I wrote: The escape processing is actually done in the lexer in the case of literals. We have to allow for bytea literals there too, regardless of encoding. The lexer naturally has no notion of the intended destination of the literal, So we need to defer the validity check to the *in functions for encoding-aware types. And it as Tom has noted, COPY does its own escape processing but does it before the transcoding. So ISTM that any solution other than something like I have proposed will probably involve substantial surgery. Below is a list of the input routines in the adt directory, courtesy of grep. I'm thinking we will need to put checks in: varcharin bpcharin textin unknownin (?) namein (?) Any others? cheers andrew acl.c:aclitemin bool.c:boolin char.c:charin date.c:timetypmodin date.c:timetztypmodin float.c:dasin float.c:dsin nabstime.c:abstimein nabstime.c:reltimein nabstime.c:tintervalin name.c:namein not_in.c:oidnotin numeric.c:numerictypmodin oid.c:oidin oid.c:oidvectorin regproc.c:regprocin regproc.c:regprocedurein regproc.c:regoperin regproc.c:regoperatorin regproc.c:regclassin regproc.c:regtypein tid.c:tidin timestamp.c:timestamptypmodin timestamp.c:timestamptztypmodin timestamp.c:intervaltypmodin varbit.c:bittypmodin varbit.c:varbittypmodin varchar.c:bpcharin varchar.c:bpchartypmodin varchar.c:varcharin varchar.c:varchartypmodin varlena.c:byteain varlena.c:textin varlena.c:unknownin xid.c:xidin xid.c:cidin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
On Sun, Mar 18, 2007 at 08:25:56AM -0400, Andrew Dunstan wrote: > It does also seem from my test results that transcoding to MB charsets > (or at least to utf-8) is surprisingly expensive, and that this would be > a good place to look at optimisation possibilities. The validity tests > can also be somewhat expensive. Hmm, I just noticed that the verify string works one character at a time, at least that part could be dramatically optimised. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Bug in UTF8-Validation Code?
Martijn van Oosterhout wrote: On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote: How can we fix this? Frankly, the statement in the docs warning about making sure that escaped sequences are valid in the server encoding is a cop-out. We don't accept invalid data elsewhere, and this should be no different IMNSHO. I don't see why this should be any different from, say, date or numeric data. For years people have sneered at MySQL because it accepted dates like Feb 31st, and rightly so. But this seems to me to be like our own version of the same problem. It seems to me that the easiest solution would be to forbid \x?? escape sequences where it's greater than \x7F for UTF-8 server encodings. Instead introduce a \u escape for specifying the unicode character directly. Under the basic principle that any escape sequence still has to represent a single character. The result can be multiple bytes, but you don't have to check for consistancy anymore. Have a nice day, The escape processing is actually done in the lexer in the case of literals. We have to allow for bytea literals there too, regardless of encoding. The lexer naturally has no notion of the intended destination of the literal, So we need to defer the validity check to the *in functions for encoding-aware types. And it as Tom has noted, COPY does its own escape processing but does it before the transcoding. So ISTM that any solution other than something like I have proposed will probably involve substantial surgery. It does also seem from my test results that transcoding to MB charsets (or at least to utf-8) is surprisingly expensive, and that this would be a good place to look at optimisation possibilities. The validity tests can also be somewhat expensive. But correctness matters most, IMNSHO. cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote: > How can we fix this? Frankly, the statement in the docs warning about > making sure that escaped sequences are valid in the server encoding is a > cop-out. We don't accept invalid data elsewhere, and this should be no > different IMNSHO. I don't see why this should be any different from, > say, date or numeric data. For years people have sneered at MySQL > because it accepted dates like Feb 31st, and rightly so. But this seems > to me to be like our own version of the same problem. It seems to me that the easiest solution would be to forbid \x?? escape sequences where it's greater than \x7F for UTF-8 server encodings. Instead introduce a \u escape for specifying the unicode character directly. Under the basic principle that any escape sequence still has to represent a single character. The result can be multiple bytes, but you don't have to check for consistancy anymore. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Bug in UTF8-Validation Code?
Maybe we should add as resurce intensive check to ascii encoding(s), that would even the score ;p let's test mysql on this, and see how worse does it perform. -- Grzegorz 'the evil' Jaskiewicz evil C/evil C++ developer for hire ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in UTF8-Validation Code?
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Here are some timing tests in 1m rows of random utf8 encoded 100 char data. It doesn't look to me like the saving you're suggesting is worth the trouble. Hmm ... not sure I believe your numbers. Using a test file of 1m lines of 100 random latin1 characters converted to utf8 (thus, about half and half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII encoding: regression=# \timing Timing is on. regression=# create temp table test(f1 text); CREATE TABLE Time: 5.047 ms regression=# copy test from '/home/tgl/zzz1m'; COPY 100 Time: 4337.089 ms and this in UTF8 encoding: utf8=# \timing Timing is on. utf8=# create temp table test(f1 text); CREATE TABLE Time: 5.108 ms utf8=# copy test from '/home/tgl/zzz1m'; COPY 100 Time: 7776.583 ms The numbers aren't super repeatable, but it sure looks to me like the encoding check adds at least 50% to the runtime in this example; so doing it twice seems unpleasant. (This is CVS HEAD, compiled without assert checking, on an x86_64 Fedora Core 6 box.) Here are some test results that are closer to yours. I used a temp table and had cassert off and fsync off, and tried with several encodings. The additional load from the test isn't 50%, (I think you have added the cost of going from ascii to utf8 to the cost of the test to get that 50%) but it is nevertheless appreciable. I agree that we should look at not testing if the client and server encodings are the same, so we can reduce the difference. cheers andrew Run SQL_ASCII LATIN1 UTF8 1 4659.38 4766.07 9134.53 2 7999.64 4003.13 6231.41 3 4178.46 6178.89 7266.39 Without test 44201.7 3930.84 10154.38 5 4092.44 .52 9438.24 6 3977.34 4197.09 8866.56 Average 4851.49 4586.76 8515.25 1 11993.86 12625.8 10109.89 2 4647.16 9192.53 11251.27 With test 3 4211.02 9903.77 10097.37 4 9203.62 7045.06 10372.25 5 4121.39 4138.78 10386.92 6 3722.73 4552.09 7432.56 Average 6316.63 7909.67 9941.71 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in UTF8-Validation Code?
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Here are some timing tests in 1m rows of random utf8 encoded 100 char data. It doesn't look to me like the saving you're suggesting is worth the trouble. Hmm ... not sure I believe your numbers. Using a test file of 1m lines of 100 random latin1 characters converted to utf8 (thus, about half and half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII encoding: regression=# \timing Timing is on. regression=# create temp table test(f1 text); CREATE TABLE Time: 5.047 ms regression=# copy test from '/home/tgl/zzz1m'; COPY 100 Time: 4337.089 ms and this in UTF8 encoding: utf8=# \timing Timing is on. utf8=# create temp table test(f1 text); CREATE TABLE Time: 5.108 ms utf8=# copy test from '/home/tgl/zzz1m'; COPY 100 Time: 7776.583 ms The numbers aren't super repeatable, but it sure looks to me like the encoding check adds at least 50% to the runtime in this example; so doing it twice seems unpleasant. (This is CVS HEAD, compiled without assert checking, on an x86_64 Fedora Core 6 box.) Are you comparing apples with apples? The db is utf8 in both of my cases. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in UTF8-Validation Code?
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Here are some timing tests in 1m rows of random utf8 encoded 100 char > data. It doesn't look to me like the saving you're suggesting is worth > the trouble. Hmm ... not sure I believe your numbers. Using a test file of 1m lines of 100 random latin1 characters converted to utf8 (thus, about half and half 7-bit ASCII and 2-byte utf8 characters), I get this in SQL_ASCII encoding: regression=# \timing Timing is on. regression=# create temp table test(f1 text); CREATE TABLE Time: 5.047 ms regression=# copy test from '/home/tgl/zzz1m'; COPY 100 Time: 4337.089 ms and this in UTF8 encoding: utf8=# \timing Timing is on. utf8=# create temp table test(f1 text); CREATE TABLE Time: 5.108 ms utf8=# copy test from '/home/tgl/zzz1m'; COPY 100 Time: 7776.583 ms The numbers aren't super repeatable, but it sure looks to me like the encoding check adds at least 50% to the runtime in this example; so doing it twice seems unpleasant. (This is CVS HEAD, compiled without assert checking, on an x86_64 Fedora Core 6 box.) regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
Tom Lane wrote: I wrote: Actually, I have to take back that objection: on closer look, COPY validates the data only once and does so before applying its own backslash-escaping rules. So there is a risk in that path too. It's still pretty annoying to be validating the data twice in the common case where no backslash reduction occurred, but I'm not sure I see any good way to avoid it. Further thought here: if we put encoding verification into textin() and related functions, could we *remove* it from COPY IN, in the common case where client and server encodings are the same? Currently, copy.c forces a trip through pg_client_to_server for multibyte encodings even when the encodings are the same, so as to perform validation. But I'm wondering whether we'd still need that. There's no risk of SQL injection in COPY data. Bogus input encoding could possibly make for confusion about where the field boundaries are, but bad data is bad data in any case. regards, tom lane Here are some timing tests in 1m rows of random utf8 encoded 100 char data. It doesn't look to me like the saving you're suggesting is worth the trouble. baseline: Time: 28228.325 ms Time: 25987.740 ms Time: 25950.707 ms Time: 25756.371 ms Time: 27589.719 ms Time: 25774.417 ms after adding suggested extra test to textin(): Time: 26722.376 ms Time: 28343.226 ms Time: 26529.364 ms Time: 28020.140 ms Time: 24836.853 ms Time: 24860.530 ms Script is: \timing create table xyz (x text); copy xyz from '/tmp/utf8.data'; truncate xyz; copy xyz from '/tmp/utf8.data'; truncate xyz; copy xyz from '/tmp/utf8.data'; truncate xyz; copy xyz from '/tmp/utf8.data'; truncate xyz; copy xyz from '/tmp/utf8.data'; truncate xyz; copy xyz from '/tmp/utf8.data'; drop table xyz; Test platform: FC6, Athlon64. cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
I wrote: > Actually, I have to take back that objection: on closer look, COPY > validates the data only once and does so before applying its own > backslash-escaping rules. So there is a risk in that path too. > It's still pretty annoying to be validating the data twice in the > common case where no backslash reduction occurred, but I'm not sure > I see any good way to avoid it. Further thought here: if we put encoding verification into textin() and related functions, could we *remove* it from COPY IN, in the common case where client and server encodings are the same? Currently, copy.c forces a trip through pg_client_to_server for multibyte encodings even when the encodings are the same, so as to perform validation. But I'm wondering whether we'd still need that. There's no risk of SQL injection in COPY data. Bogus input encoding could possibly make for confusion about where the field boundaries are, but bad data is bad data in any case. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug in UTF8-Validation Code?
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> The problem with that is that it duplicates effort: in many cases >> (especially COPY IN) the data's already been validated. > One thought I had was that it might make sense to have a flag that would > inhibit the check, that could be set (and reset) by routines that check > for themselves, such as COPY IN. Then bulk load performance should not > be hit much. Actually, I have to take back that objection: on closer look, COPY validates the data only once and does so before applying its own backslash-escaping rules. So there is a risk in that path too. It's still pretty annoying to be validating the data twice in the common case where no backslash reduction occurred, but I'm not sure I see any good way to avoid it. I don't much want to add another argument to input functions, and the global flag that you suggest above seems too ugly/risky. Would someone do some performance checking on the cost of adding mbverify to textin()? If it could be shown that it adds only negligible overhead to COPY, on say hundred-byte-wide text fields, then we could decide that this isn't worth worrying about. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
Tom Lane wrote: Andrew Dunstan <[EMAIL PROTECTED]> writes: Last year Jeff suggested adding something like: pg_verifymbstr(string,strlen(string),0); to each relevant input routine. Would that be an acceptable solution? The problem with that is that it duplicates effort: in many cases (especially COPY IN) the data's already been validated. I'm not sure how to fix that, but I think you'll get some push-back if you double the encoding verification work in COPY for nothing. Given that we are moving away from backslash-enabled literals, I'm not as convinced as some that this must be fixed... They will still be available in E'\nn' form, won't they? One thought I had was that it might make sense to have a flag that would inhibit the check, that could be set (and reset) by routines that check for themselves, such as COPY IN. Then bulk load performance should not be hit much. cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Last year Jeff suggested adding something like: > pg_verifymbstr(string,strlen(string),0); > to each relevant input routine. Would that be an acceptable solution? The problem with that is that it duplicates effort: in many cases (especially COPY IN) the data's already been validated. I'm not sure how to fix that, but I think you'll get some push-back if you double the encoding verification work in COPY for nothing. Given that we are moving away from backslash-enabled literals, I'm not as convinced as some that this must be fixed... regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
Jeff Davis wrote: On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote: On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote: Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake: Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where we had to use iconv? What issues? I've upgraded several 8.0 database to 8.1. without having to use iconv. Did I miss something? http://www.postgresql.org/docs/8.1/interactive/release-8-1.html "Some users are having problems loading UTF-8 data into 8.1.X. This is because previous versions allowed invalid UTF-8 byte sequences to be entered into the database, and this release properly accepts only valid UTF-8 sequences. One way to correct a dumpfile is to run the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql." If the above quote were actually true, then Mario wouldn't be having a problem. Instead, it's half-true: Invalid byte sequences are rejected in some situations and accepted in others. If postgresql consistently rejected or consistently accepted invalid byte sequences, that would not cause problems with COPY (meaning problems with pg_dump, slony, etc.). How can we fix this? Frankly, the statement in the docs warning about making sure that escaped sequences are valid in the server encoding is a cop-out. We don't accept invalid data elsewhere, and this should be no different IMNSHO. I don't see why this should be any different from, say, date or numeric data. For years people have sneered at MySQL because it accepted dates like Feb 31st, and rightly so. But this seems to me to be like our own version of the same problem. Last year Jeff suggested adding something like: pg_verifymbstr(string,strlen(string),0); to each relevant input routine. Would that be an acceptable solution? If not, what would be? cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote: > On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote: > > Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake: > > > Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where > > > we had to use iconv? > > > > What issues? I've upgraded several 8.0 database to 8.1. without having to > > use > > iconv. Did I miss something? > > http://www.postgresql.org/docs/8.1/interactive/release-8-1.html > > "Some users are having problems loading UTF-8 data into 8.1.X. This > is because previous versions allowed invalid UTF-8 byte sequences > to be entered into the database, and this release properly accepts > only valid UTF-8 sequences. One way to correct a dumpfile is to run > the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql." > If the above quote were actually true, then Mario wouldn't be having a problem. Instead, it's half-true: Invalid byte sequences are rejected in some situations and accepted in others. If postgresql consistently rejected or consistently accepted invalid byte sequences, that would not cause problems with COPY (meaning problems with pg_dump, slony, etc.). Regards, Jeff Davis ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug in UTF8-Validation Code?
On Tue, 2007-03-13 at 12:00 +0100, Mario Weilguni wrote: > Hi, > > I've a problem with a database, I can dump the database to a file, but > restoration fails, happens with 8.1.4. I reported the same problem a while back: http://archives.postgresql.org/pgsql-bugs/2006-10/msg00246.php Some people think it's a bug, some people don't. It is technically documented behavior, but I don't think the documentation is clear enough. I think it is a bug that should be fixed, and here's another message in the thread that expresses my opinion: http://archives.postgresql.org/pgsql-bugs/2006-11/msg00033.php If you look at that email, it includes some examples of surprising behaviors caused by that bug, particularly with bytea. In some applications (for which it's impractical to change the source code), I actually use a CHECK constraint (which raises an exception on invalid utf8 data) on every text column so that some binary data doesn't break my slony replication. I'd like to see this fixed. Regards, Jeff Davis ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
Albe Laurenz wrote: Mario Weilguni wrote: Is there anything I can do to help with this problem? Maybe implementing a new GUC variable that turns off accepting wrong encoded sequences (so DBAs still can turn it on if they really depend on it)? I think that this should be done away with unconditionally. Or does anybody have a good point for allowing corrupt data in text columns? Maybe it is the way it is now because nobody could be bothered to add the appropriate checks... I agree. It's more or less an integrity violation, IMNSHO. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in UTF8-Validation Code?
Mario Weilguni wrote: > Is there anything I can do to help with this problem? Maybe implementing a new > GUC variable that turns off accepting wrong encoded sequences (so DBAs still > can turn it on if they really depend on it)? I think that this should be done away with unconditionally. Or does anybody have a good point for allowing corrupt data in text columns? Maybe it is the way it is now because nobody could be bothered to add the appropriate checks... Yours, Laurenz Albe ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Bug in UTF8-Validation Code?
Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold: > Andrew Dunstan wrote: > > > > This strikes me as essential. If the db has a certain encoding ISTM we > > are promising that all the text data is valid for that encoding. > > > > The question in my mind is how we help people to recover from the fact > > that we haven't done that. > > I would also say that it's a bug that escape sequences can get characters > into the database that are not valid in the specified encoding. If you > compare the encoding to table constraints, there is no way to simply > "escape" a constraint check. > > This seems to violate the principle of consistency in ACID. Additionally, > if you include pg_dump into ACID, it also violates durability, since it > cannot restore what it wrote itself. > Is there anything in the SQL spec that asks for such a behaviour? I guess > not. > > A DBA will usually not even learn about this issue until they are presented > with a failing restore. Is there anything I can do to help with this problem? Maybe implementing a new GUC variable that turns off accepting wrong encoded sequences (so DBAs still can turn it on if they really depend on it)? For me, Best regards, Mario Weilguni ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold: > Is there anything in the SQL spec that asks for such a behaviour? I guess > not. I think that the octal escapes are a holdover from the single-byte days where they were simply a way to enter characters that are difficult to find on a keyboard. In today's multi-encoding world, it would make more sense if there were an escape sequence for a *codepoint* which is then converted to the actual encoding (if possible and valid) in the server. The meaning of codepoint is, however, character set dependent as well. The SQL standard supports escape sequences for Unicode codepoints, which I think would be a very useful feature (try entering a UTF-8 character bytewise ...), but it's a bit weird to implement and it's not clear how to handle character sets other than Unicode. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote: > Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake: > > Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where > > we had to use iconv? > > What issues? I've upgraded several 8.0 database to 8.1. without having to use > iconv. Did I miss something? http://www.postgresql.org/docs/8.1/interactive/release-8-1.html "Some users are having problems loading UTF-8 data into 8.1.X. This is because previous versions allowed invalid UTF-8 byte sequences to be entered into the database, and this release properly accepts only valid UTF-8 sequences. One way to correct a dumpfile is to run the command iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql." -- Michael Fuhr ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Bug in UTF8-Validation Code?
Andrew Dunstan wrote: Albe Laurenz wrote: A fix could be either that the server checks escape sequences for validity This strikes me as essential. If the db has a certain encoding ISTM we are promising that all the text data is valid for that encoding. The question in my mind is how we help people to recover from the fact that we haven't done that. I would also say that it's a bug that escape sequences can get characters into the database that are not valid in the specified encoding. If you compare the encoding to table constraints, there is no way to simply "escape" a constraint check. This seems to violate the principle of consistency in ACID. Additionally, if you include pg_dump into ACID, it also violates durability, since it cannot restore what it wrote itself. Is there anything in the SQL spec that asks for such a behaviour? I guess not. A DBA will usually not even learn about this issue until they are presented with a failing restore. Best Regards, Michael Paesold ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in UTF8-Validation Code?
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake: > Andrew Dunstan wrote: > > Albe Laurenz wrote: > >> A fix could be either that the server checks escape sequences for > >> validity > > > > This strikes me as essential. If the db has a certain encoding ISTM we > > are promising that all the text data is valid for that encoding. > > > > The question in my mind is how we help people to recover from the fact > > that we haven't done that. > > Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where > we had to use iconv? > What issues? I've upgraded several 8.0 database to 8.1. without having to use iconv. Did I miss something? Regards, Mario Weilguni ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Bug in UTF8-Validation Code?
Andrew Dunstan wrote: > Albe Laurenz wrote: >> A fix could be either that the server checks escape sequences for >> validity >> > > This strikes me as essential. If the db has a certain encoding ISTM we > are promising that all the text data is valid for that encoding. > > The question in my mind is how we help people to recover from the fact > that we haven't done that. Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where we had to use iconv? Joshua D. Drake > > cheers > > andrew > > > > > ---(end of broadcast)--- > TIP 5: don't forget to increase your free space map settings > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug in UTF8-Validation Code?
Albe Laurenz wrote: A fix could be either that the server checks escape sequences for validity This strikes me as essential. If the db has a certain encoding ISTM we are promising that all the text data is valid for that encoding. The question in my mind is how we help people to recover from the fact that we haven't done that. cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
Mario Weilguni wrote: >>> Steps to reproduce: >>> create database testdb with encoding='UTF8'; >>> \c testdb >>> create table test(x text); >>> insert into test values ('\244'); ==> Is akzepted, even if not UTF8. >> >> This is working as expected, see the remark in >> >> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS >> >> "It is your responsibility that the byte sequences you create >> are valid characters in the server character set encoding." > > In that case, pg_dump is doing wrong here and should quote the output. IMO it > cannot be defined as working as expected, when this makes any database dumps > worthless, without any warnings at dump-time. > > pg_dump should output \244 itself in that case. True. Here is a test case on 8.2.3 (OS, database and client all use UTF8): test=> CREATE TABLE test(x text); CREATE TABLE test=> INSERT INTO test VALUES ('correct: ä'); INSERT 0 1 test=> INSERT INTO test VALUES (E'incorrect: \244'); INSERT 0 1 test=> \q laurenz:~> pg_dump -d -t test -f test.sql Here is an excerpt from 'od -c test.sql': 0001040 e n z \n - - \n \n I N S E R T I 0001060 N T O t e s t V A L U E S 0001100 ( ' c o r r e c t : 303 244 ' ) ; 0001120 \n I N S E R T I N T O t e s 0001140 t V A L U E S ( ' i n c o r 0001160 r e c t : 244 ' ) ; \n \n \n - - \n The invalid character (octal 244) is in the INSERT statement! This makes psql gag: test=> DROP TABLE test; DROP TABLE test=> \i test.sql SET SET SET SET SET SET SET SET CREATE TABLE ALTER TABLE INSERT 0 1 psql:test.sql:33: ERROR: invalid byte sequence for encoding "UTF8": 0xa4 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". A fix could be either that the server checks escape sequences for validity or that pg_dump outputs invalid bytes as escape sequences. Or pg_dump could stop with an error. I think that the cleanest way would be the first. Yours, Laurenz Albe ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug in UTF8-Validation Code?
Am Dienstag, 13. März 2007 15:12 schrieb Andrew Dunstan: > The sentence quoted from the docs is perhaps less than a model of > clarity. I would take it to mean that no client-encoding -> > server-encoding translation will take place. Does it really mean that > the server will happily accept any escaped byte sequence, whether or not > it is valid for the server encoding? If so that seems ... odd. Yes, \octal sequences are accepted even if invalid. The problem is, pgdump will happily dump those sequences as is, so in that case a char ascii 0xa4 is emitted, and so the dump cannot be restored with pg_restore. A dangerous feature IMO, and will make a lot of DB admins very unhappy if they have to validate every day if the precious database dumps can be restored in case of an error. Best regards, Mario Weilguni ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in UTF8-Validation Code?
Mario Weilguni wrote: Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz: Mario Weilguni wrote: Steps to reproduce: create database testdb with encoding='UTF8'; \c testdb create table test(x text); insert into test values ('\244'); ==> Is akzepted, even if not UTF8. This is working as expected, see the remark in http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ L-SYNTAX-STRINGS "It is your responsibility that the byte sequences you create are valid characters in the server character set encoding." In that case, pg_dump is doing wrong here and should quote the output. IMO it cannot be defined as working as expected, when this makes any database dumps worthless, without any warnings at dump-time. pg_dump should output \244 itself in that case. The sentence quoted from the docs is perhaps less than a model of clarity. I would take it to mean that no client-encoding -> server-encoding translation will take place. Does it really mean that the server will happily accept any escaped byte sequence, whether or not it is valid for the server encoding? If so that seems ... odd. cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug in UTF8-Validation Code?
Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz: > Mario Weilguni wrote: > > Steps to reproduce: > > create database testdb with encoding='UTF8'; > > \c testdb > > create table test(x text); > > insert into test values ('\244'); ==> Is akzepted, even if not UTF8. > > This is working as expected, see the remark in > http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ > L-SYNTAX-STRINGS > > "It is your responsibility that the byte sequences you create > are valid characters in the server character set encoding." In that case, pg_dump is doing wrong here and should quote the output. IMO it cannot be defined as working as expected, when this makes any database dumps worthless, without any warnings at dump-time. pg_dump should output \244 itself in that case. Best regards, Mario Weilguni ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Bug in UTF8-Validation Code?
Mario Weilguni wrote: > Steps to reproduce: > create database testdb with encoding='UTF8'; > \c testdb > create table test(x text); > insert into test values ('\244'); ==> Is akzepted, even if not UTF8. This is working as expected, see the remark in http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQ L-SYNTAX-STRINGS "It is your responsibility that the byte sequences you create are valid characters in the server character set encoding." Yours, Laurenz Albe ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Bug in UTF8-Validation Code?
Hi, I've a problem with a database, I can dump the database to a file, but restoration fails, happens with 8.1.4. Steps to reproduce: create database testdb with encoding='UTF8'; \c testdb create table test(x text); insert into test values ('\244'); ==> Is akzepted, even if not UTF8. pg_dump testdb -f testdb.dump -Fc pg_restore -f testdb.dump -d testdb => fails with an error: ERROR: invalid byte sequence for encoding "UTF8": 0xa4 The problem itself comes from a CSV file, which is imported with \copy without proper quoting (so I have to fix this anyway), but I still think this is an error, making restoration very complicated in such cases... Or am I doing something completly wrong here? Best regards, Mario Weilguni ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match