Re: [HACKERS] Bug in UTF8-Validation Code?

2007-06-13 Thread Andrew Dunstan
What is the state of play with this item? I think this is a must-fix bug for 8.3. There was a flurry of messages back in April but since then I don't recall seeing anything. cheers andrew Mark Dilger wrote: Mark Dilger wrote: Bruce Momjian wrote: Added to TODO: * Fix cases where

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-06 Thread Albe Laurenz
Martijn van Oosterhout wrote: So your implemntation is simply: 1. Take number and make UTF-8 string 2. Convert it to database encoding. Aah, now I can spot where the misunderstanding is. That's not what I mean. I mean that chr() should simply 'typecast' to char. So when the database encoding

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-05 Thread Albe Laurenz
Tatsuo Ishii wrote: I think we need to continute design discussion, probably targetting for 8.4, not 8.3. But isn't a simple fix for chr() and ascii(), which does not require a redesign, a Good Thing for 8.3 if possible? Something that maintains as much upward and/or Oracle compatibility as

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-05 Thread Martijn van Oosterhout
On Thu, Apr 05, 2007 at 09:34:25AM +0900, Tatsuo Ishii wrote: I'm not sure what kind of use case for unicode_char() you are thinking about. Anyway if you want a code point from a character, we could easily add such functions to all backend encodings currently we support. Probably it would look

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-05 Thread Martijn van Oosterhout
On Thu, Apr 05, 2007 at 11:52:14AM +0200, Albe Laurenz wrote: But isn't a simple fix for chr() and ascii(), which does not require a redesign, a Good Thing for 8.3 if possible? Something that maintains as much upward and/or Oracle compatibility as possible while doing away with ascii('EUR')

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-05 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes: I think the problem is that most encodings do not have the concept of a code point anyway, so implementing it for them is fairly useless. Yeah. I'm beginning to think that the right thing to do is (a) make chr/ascii do the same thing as Oracle

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Martijn van Oosterhout
On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote: I think it's probably defensible for non-Unicode encodings. To do otherwise would require (a) figuring out what the equivalent concept to code point is for each encoding, and (b) having a separate code path for each encoding to perform

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Albe Laurenz
Mark Dilger wrote: What I suggest (and what Oracle implements, and isn't CHR() and ASCII() partly for Oracle compatibility?) is that CHR() and ASCII() convert between a character (in database encoding) and that database encoding in numeric form. Looking at Oracle documentation, it appears

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Zeugswetter Andreas ADI SD
What do others think? Should the argument to CHR() be a Unicode code point or the numeric representation of the database encoding? When the database uses a single byte encoding, the chr function takes the binary byte representation as an integer number between 0 and 255 (e.g. ascii code).

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Albe Laurenz
When the database uses a single byte encoding, the chr function takes the binary byte representation as an integer number between 0 and 255 (e.g. ascii code). When the database encoding is one of the unicode encodings it takes a unicode code point. This is also what Oracle does. Sorry, but

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Zeugswetter Andreas ADI SD
When the database uses a single byte encoding, the chr function takes the binary byte representation as an integer number between 0 and 255 (e.g. ascii code). When the database encoding is one of the unicode encodings it takes a unicode code point. This is also what Oracle does.

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Alvaro Herrera
Martijn van Oosterhout wrote: On Tue, Apr 03, 2007 at 01:06:38PM -0400, Tom Lane wrote: I think it's probably defensible for non-Unicode encodings. To do otherwise would require (a) figuring out what the equivalent concept to code point is for each encoding, and (b) having a separate code

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Andrew - Supernews
On 2007-04-04, Alvaro Herrera [EMAIL PROTECTED] wrote: Right -- IMHO what we should be doing is reject any input to chr() which is beyond plain ASCII (or maybe 255), and create a separate function (unicode_char() sounds good) to get an Unicode character from a code point, converted to the

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes: Right -- IMHO what we should be doing is reject any input to chr() which is beyond plain ASCII (or maybe 255), and create a separate function (unicode_char() sounds good) to get an Unicode character from a code point, converted to the local

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Tom Lane
Andrew - Supernews [EMAIL PROTECTED] writes: Thinking about this made me realize that there's another, ahem, elephant in the room here: convert(). By definition convert() returns text strings which are not valid in the server encoding. How can this be addressed? Remove convert(). Or at least

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Martijn van Oosterhout
On Wed, Apr 04, 2007 at 10:22:28AM -0400, Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Right -- IMHO what we should be doing is reject any input to chr() which is beyond plain ASCII (or maybe 255), and create a separate function (unicode_char() sounds good) to get an Unicode

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Tatsuo Ishii
Alvaro Herrera [EMAIL PROTECTED] writes: Right -- IMHO what we should be doing is reject any input to chr() which is beyond plain ASCII (or maybe 255), and create a separate function (unicode_char() sounds good) to get an Unicode character from a code point, converted to the local

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Alvaro Herrera
Tatsuo Ishii wrote: BTW, every encoding has its own charset. However the relationship between encoding and charset are not so simple as Unicode. For example, encoding EUC_JP correponds to multiple charsets, namely ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which returns a

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Peter Eisentraut
Am Mittwoch, 4. April 2007 16:22 schrieb Tom Lane: Alvaro Herrera [EMAIL PROTECTED] writes: Right -- IMHO what we should be doing is reject any input to chr() which is beyond plain ASCII (or maybe 255), and create a separate function (unicode_char() sounds good) to get an Unicode character

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Mark Dilger
Albe Laurenz wrote: There's one thing that strikes me as weird in your implementation: pgsql=# select chr(0); ERROR: character 0x00 of encoding SQL_ASCII has no equivalent in UTF8 0x00 is a valid UNICODE code point and also a valid UTF-8 character! It's not my code that rejects this. I'm

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Mark Dilger
Tatsuo Ishii wrote: SNIP. I think we need to continute design discussion, probably targetting for 8.4, not 8.3. The discussion came about because Andrew - Supernews noticed that chr() returns invalid utf8, and we're trying to fix all the bugs with invalid utf8 in the system. Something

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Tom Lane
Mark Dilger [EMAIL PROTECTED] writes: Albe Laurenz wrote: 0x00 is a valid UNICODE code point and also a valid UTF-8 character! It's not my code that rejects this. I'm passing the resultant string to the pg_verify_mbstr(...) function and it is rejecting a null. I could change that, of

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Tatsuo Ishii
Tatsuo Ishii wrote: SNIP. I think we need to continute design discussion, probably targetting for 8.4, not 8.3. The discussion came about because Andrew - Supernews noticed that chr() returns invalid utf8, and we're trying to fix all the bugs with invalid utf8 in the system.

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Tatsuo Ishii
Tatsuo Ishii wrote: BTW, every encoding has its own charset. However the relationship between encoding and charset are not so simple as Unicode. For example, encoding EUC_JP correponds to multiple charsets, namely ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Tatsuo Ishii
Andrew - Supernews [EMAIL PROTECTED] writes: Thinking about this made me realize that there's another, ahem, elephant in the room here: convert(). By definition convert() returns text strings which are not valid in the server encoding. How can this be addressed? Remove convert(). Or

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-04 Thread Andrew - Supernews
On 2007-04-05, Tatsuo Ishii [EMAIL PROTECTED] wrote: Andrew - Supernews [EMAIL PROTECTED] writes: Thinking about this made me realize that there's another, ahem, elephant in the room here: convert(). By definition convert() returns text strings which are not valid in the server encoding.

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-03 Thread Albe Laurenz
Mark Dilger wrote: In particular, in UTF8 land I'd have expected the argument of chr() to be interpreted as a Unicode code point, not as actual UTF8 bytes with a randomly-chosen endianness. Not sure what to do in other multibyte encodings. Not sure what to do in other multibyte encodings

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-03 Thread Andrew - Supernews
On 2007-04-03, Albe Laurenz [EMAIL PROTECTED] wrote: According to RFC 2279, the Euro, Unicode code point 0x20AC = 0010 1010 1100, will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC. IMHO this is the only good and intuitive way for CHR() and ASCII(). It is beyond ludicrous for

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-03 Thread Martijn van Oosterhout
On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote: IMHO this is the only good and intuitive way for CHR() and ASCII(). Hardly. The comment earlier about mbtowc was much closer to the mark. And wide characters are defined as Unicode points. Basically, CHR() takes a unicode point and

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-03 Thread Albe Laurenz
Andrew wrote: According to RFC 2279, the Euro, Unicode code point 0x20AC = 0010 1010 1100, will be encoded to 1110 0010 1000 0010 1010 1100 = 0xE282AC. IMHO this is the only good and intuitive way for CHR() and ASCII(). It is beyond ludicrous for functions like chr() or ascii() to

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-03 Thread Mark Dilger
Martijn van Oosterhout wrote: On Tue, Apr 03, 2007 at 11:43:21AM +0200, Albe Laurenz wrote: IMHO this is the only good and intuitive way for CHR() and ASCII(). Hardly. The comment earlier about mbtowc was much closer to the mark. And wide characters are defined as Unicode points. Basically,

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-03 Thread Tom Lane
Mark Dilger [EMAIL PROTECTED] writes: Martijn van Oosterhout wrote: Just about every multibyte encoding other than Unicode has the problem of not distinguishing between the code point and the encoding of it. Thanks for the feedback. Would you say that the way I implemented things in the

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-03 Thread Mark Dilger
Albe Laurenz wrote: What I suggest (and what Oracle implements, and isn't CHR() and ASCII() partly for Oracle compatibility?) is that CHR() and ASCII() convert between a character (in database encoding) and that database encoding in numeric form. Looking at Oracle documentation, it appears

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-02 Thread Mark Dilger
Andrew - Supernews wrote: On 2007-04-01, Mark Dilger [EMAIL PROTECTED] wrote: Do any of the string functions (see http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the risk of generating invalid utf8 encoded strings? Do I need to add checks? Are there known bugs with

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-02 Thread Mark Dilger
Mark Dilger wrote: Andrew - Supernews wrote: On 2007-04-01, Mark Dilger [EMAIL PROTECTED] wrote: Do any of the string functions (see http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the risk of generating invalid utf8 encoded strings? Do I need to add checks? Are

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-02 Thread Tom Lane
Mark Dilger [EMAIL PROTECTED] writes: pgsql=# select chr(14989485); chr - 中 (1 row) Is there a principled rationale for this particular behavior as opposed to any other? In particular, in UTF8 land I'd have expected the argument of chr() to be interpreted as a Unicode code point, not

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-02 Thread Mark Dilger
Tom Lane wrote: Mark Dilger [EMAIL PROTECTED] writes: pgsql=# select chr(14989485); chr - 中 (1 row) Is there a principled rationale for this particular behavior as opposed to any other? In particular, in UTF8 land I'd have expected the argument of chr() to be interpreted as a Unicode

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-02 Thread Mark Dilger
Mark Dilger wrote: Tom Lane wrote: Mark Dilger [EMAIL PROTECTED] writes: pgsql=# select chr(14989485); chr - 中 (1 row) Is there a principled rationale for this particular behavior as opposed to any other? In particular, in UTF8 land I'd have expected the argument of chr() to be

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-02 Thread Mark Dilger
Mark Dilger wrote: Tom Lane wrote: Mark Dilger [EMAIL PROTECTED] writes: pgsql=# select chr(14989485); chr - 中 (1 row) Is there a principled rationale for this particular behavior as opposed to any other? In particular, in UTF8 land I'd have expected the argument of chr() to be

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-02 Thread Mark Dilger
Mark Dilger wrote: Since chr() is defined in oracle_compat.c, I decided to look at what Oracle might do. See http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96540/functions18a.htm It looks to me like they are doing the same thing that I did, though I don't have Oracle

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-02 Thread Andrew - Supernews
On 2007-04-02, Mark Dilger [EMAIL PROTECTED] wrote: Here's the code for the new chr() function: if (pg_database_encoding_max_length() 1 !lc_ctype_is_c()) Clearly wrong - this allows returning invalid UTF8 data in locale C, which is not an uncommon setting to use. Treating the parameter

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-01 Thread Martijn van Oosterhout
On Sat, Mar 31, 2007 at 07:47:21PM -0700, Mark Dilger wrote: OK, I can take a stab at fixing this. I'd like to state some assumptions so people can comment and reply: I assume that I need to fix *all* cases where invalid byte encodings get into the database through functions shipped in

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-01 Thread Andrew - Supernews
On 2007-04-01, Mark Dilger [EMAIL PROTECTED] wrote: Do any of the string functions (see http://www.postgresql.org/docs/8.2/interactive/functions-string.html) run the risk of generating invalid utf8 encoded strings? Do I need to add checks? Are there known bugs with these functions in this

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-01 Thread Mark Dilger
Martijn van Oosterhout wrote: There's also the performance angle. The current mbverify is very inefficient for encodings like UTF-8. You might need to refactor a bit there... There appears to be a lot of function call overhead in the current implementation. In pg_verify_mbstr, the function

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-01 Thread Tom Lane
Mark Dilger [EMAIL PROTECTED] writes: Refactoring the way these table driven functions work would impact lots of other code. Just grep for all files #including mb/pg_wchar.h for the list of them. The list includes interfaces/libpq, and I'm wondering if software that links against postgres

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-01 Thread Tatsuo Ishii
Mark Dilger [EMAIL PROTECTED] writes: Refactoring the way these table driven functions work would impact lots of other code. Just grep for all files #including mb/pg_wchar.h for the list of them. The list includes interfaces/libpq, and I'm wondering if software that links against

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-04-01 Thread Tom Lane
Tatsuo Ishii [EMAIL PROTECTED] writes: No, we've never exported those with the intent that client code should use 'em. I thought PQescapeString() of 8.3 uses mbverify functions to make sure that user supplied multibyte string is valid. Certainly --- but we can change PQescapeString to match

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-31 Thread Mark Dilger
Bruce Momjian wrote: Added to TODO: * Fix cases where invalid byte encodings are accepted by the database, but throw an error on SELECT http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php Is anyone working on fixing this bug? Hi, has anyone

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-31 Thread Mark Dilger
Mark Dilger wrote: Bruce Momjian wrote: Added to TODO: * Fix cases where invalid byte encodings are accepted by the database, but throw an error on SELECT http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php Is anyone working on fixing this bug? Hi, has

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-22 Thread Bruce Momjian
Added to TODO: * Fix cases where invalid byte encodings are accepted by the database, but throw an error on SELECT http://archives.postgresql.org/pgsql-hackers/2007-03/msg00767.php Is anyone working on fixing this bug?

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-19 Thread Mario Weilguni
Am Sonntag, 18. März 2007 12:36 schrieb Martijn van Oosterhout: It seems to me that the easiest solution would be to forbid \x?? escape sequences where it's greater than \x7F for UTF-8 server encodings. Instead introduce a \u escape for specifying the unicode character directly. Under the

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-18 Thread Grzegorz Jaskiewicz
evil mode1 Maybe we should add as resurce intensive check to ascii encoding(s), that would even the score ;p /evil mode1 evil mode 2 let's test mysql on this, and see how worse does it perform. /evil mode 2 -- Grzegorz 'the evil' Jaskiewicz evil C/evil C++ developer for hire

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-18 Thread Martijn van Oosterhout
On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote: How can we fix this? Frankly, the statement in the docs warning about making sure that escaped sequences are valid in the server encoding is a cop-out. We don't accept invalid data elsewhere, and this should be no different

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-18 Thread Andrew Dunstan
Martijn van Oosterhout wrote: On Sat, Mar 17, 2007 at 11:46:01AM -0400, Andrew Dunstan wrote: How can we fix this? Frankly, the statement in the docs warning about making sure that escaped sequences are valid in the server encoding is a cop-out. We don't accept invalid data elsewhere, and

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-18 Thread Martijn van Oosterhout
On Sun, Mar 18, 2007 at 08:25:56AM -0400, Andrew Dunstan wrote: It does also seem from my test results that transcoding to MB charsets (or at least to utf-8) is surprisingly expensive, and that this would be a good place to look at optimisation possibilities. The validity tests can also be

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-18 Thread Andrew Dunstan
I wrote: The escape processing is actually done in the lexer in the case of literals. We have to allow for bytea literals there too, regardless of encoding. The lexer naturally has no notion of the intended destination of the literal, So we need to defer the validity check to the *in

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-18 Thread Gregory Stark
Andrew Dunstan [EMAIL PROTECTED] writes: Below is a list of the input routines in the adt directory, courtesy of grep. Grep isn't a good way to get these, your list missed a bunch. postgres=# select distinct prosrc from pg_proc where oid in (select typinput from pg_type); prosrc

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-18 Thread Andrew Dunstan
Gregory Stark wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Below is a list of the input routines in the adt directory, courtesy of grep. Grep isn't a good way to get these, your list missed a bunch. postgres=# select distinct prosrc from pg_proc where oid in (select typinput

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-18 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Ok, good point. Now, which of those need to have a check for valid encoding? The vast majority will barf on any non-ASCII character anyway ... only the ones that don't will need a check. regards, tom lane

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-18 Thread ITAGAKI Takahiro
Jeff Davis [EMAIL PROTECTED] wrote: Some people think it's a bug, some people don't. It is technically documented behavior, but I don't think the documentation is clear enough. I think it is a bug that should be fixed, and here's another message in the thread that expresses my opinion:

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-17 Thread Andrew Dunstan
Jeff Davis wrote: On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote: On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote: Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake: Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where we had

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Last year Jeff suggested adding something like: pg_verifymbstr(string,strlen(string),0); to each relevant input routine. Would that be an acceptable solution? The problem with that is that it duplicates effort: in many cases (especially COPY IN) the

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-17 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Last year Jeff suggested adding something like: pg_verifymbstr(string,strlen(string),0); to each relevant input routine. Would that be an acceptable solution? The problem with that is that it duplicates effort: in many

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane wrote: The problem with that is that it duplicates effort: in many cases (especially COPY IN) the data's already been validated. One thought I had was that it might make sense to have a flag that would inhibit the check, that could be set

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-17 Thread Tom Lane
I wrote: Actually, I have to take back that objection: on closer look, COPY validates the data only once and does so before applying its own backslash-escaping rules. So there is a risk in that path too. It's still pretty annoying to be validating the data twice in the common case where no

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-17 Thread Andrew Dunstan
Tom Lane wrote: I wrote: Actually, I have to take back that objection: on closer look, COPY validates the data only once and does so before applying its own backslash-escaping rules. So there is a risk in that path too. It's still pretty annoying to be validating the data twice

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-17 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes: Here are some timing tests in 1m rows of random utf8 encoded 100 char data. It doesn't look to me like the saving you're suggesting is worth the trouble. Hmm ... not sure I believe your numbers. Using a test file of 1m lines of 100 random latin1

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-17 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Here are some timing tests in 1m rows of random utf8 encoded 100 char data. It doesn't look to me like the saving you're suggesting is worth the trouble. Hmm ... not sure I believe your numbers. Using a test file of 1m lines

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-17 Thread Andrew Dunstan
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: Here are some timing tests in 1m rows of random utf8 encoded 100 char data. It doesn't look to me like the saving you're suggesting is worth the trouble. Hmm ... not sure I believe your numbers. Using a test file of 1m lines

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-16 Thread Mario Weilguni
Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold: Andrew Dunstan wrote: This strikes me as essential. If the db has a certain encoding ISTM we are promising that all the text data is valid for that encoding. The question in my mind is how we help people to recover from the fact

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-16 Thread Albe Laurenz
Mario Weilguni wrote: Is there anything I can do to help with this problem? Maybe implementing a new GUC variable that turns off accepting wrong encoded sequences (so DBAs still can turn it on if they really depend on it)? I think that this should be done away with unconditionally. Or does

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-16 Thread Andrew Dunstan
Albe Laurenz wrote: Mario Weilguni wrote: Is there anything I can do to help with this problem? Maybe implementing a new GUC variable that turns off accepting wrong encoded sequences (so DBAs still can turn it on if they really depend on it)? I think that this

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-16 Thread Jeff Davis
On Tue, 2007-03-13 at 12:00 +0100, Mario Weilguni wrote: Hi, I've a problem with a database, I can dump the database to a file, but restoration fails, happens with 8.1.4. I reported the same problem a while back: http://archives.postgresql.org/pgsql-bugs/2006-10/msg00246.php Some people

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-16 Thread Jeff Davis
On Wed, 2007-03-14 at 01:29 -0600, Michael Fuhr wrote: On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote: Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake: Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where we had to use iconv? What issues?

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-14 Thread Michael Paesold
Andrew Dunstan wrote: Albe Laurenz wrote: A fix could be either that the server checks escape sequences for validity This strikes me as essential. If the db has a certain encoding ISTM we are promising that all the text data is valid for that encoding. The question in my mind is how we

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-14 Thread Michael Fuhr
On Tue, Mar 13, 2007 at 04:42:35PM +0100, Mario Weilguni wrote: Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake: Is this any different than the issues of moving 8.0.x to 8.1 UTF8? Where we had to use iconv? What issues? I've upgraded several 8.0 database to 8.1. without having to

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-14 Thread Peter Eisentraut
Am Mittwoch, 14. März 2007 08:01 schrieb Michael Paesold: Is there anything in the SQL spec that asks for such a behaviour? I guess not. I think that the octal escapes are a holdover from the single-byte days where they were simply a way to enter characters that are difficult to find on a

[HACKERS] Bug in UTF8-Validation Code?

2007-03-13 Thread Mario Weilguni
Hi, I've a problem with a database, I can dump the database to a file, but restoration fails, happens with 8.1.4. Steps to reproduce: create database testdb with encoding='UTF8'; \c testdb create table test(x text); insert into test values ('\244'); == Is akzepted, even if not UTF8. pg_dump

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-13 Thread Albe Laurenz
Mario Weilguni wrote: Steps to reproduce: create database testdb with encoding='UTF8'; \c testdb create table test(x text); insert into test values ('\244'); == Is akzepted, even if not UTF8. This is working as expected, see the remark in

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-13 Thread Mario Weilguni
Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz: Mario Weilguni wrote: Steps to reproduce: create database testdb with encoding='UTF8'; \c testdb create table test(x text); insert into test values ('\244'); == Is akzepted, even if not UTF8. This is working as expected, see the

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-13 Thread Andrew Dunstan
Mario Weilguni wrote: Am Dienstag, 13. März 2007 14:46 schrieb Albe Laurenz: Mario Weilguni wrote: Steps to reproduce: create database testdb with encoding='UTF8'; \c testdb create table test(x text); insert into test values ('\244'); == Is akzepted, even if not UTF8. This is

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-13 Thread Mario Weilguni
Am Dienstag, 13. März 2007 15:12 schrieb Andrew Dunstan: The sentence quoted from the docs is perhaps less than a model of clarity. I would take it to mean that no client-encoding - server-encoding translation will take place. Does it really mean that the server will happily accept any escaped

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-13 Thread Albe Laurenz
Mario Weilguni wrote: Steps to reproduce: create database testdb with encoding='UTF8'; \c testdb create table test(x text); insert into test values ('\244'); == Is akzepted, even if not UTF8. This is working as expected, see the remark in

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-13 Thread Andrew Dunstan
Albe Laurenz wrote: A fix could be either that the server checks escape sequences for validity This strikes me as essential. If the db has a certain encoding ISTM we are promising that all the text data is valid for that encoding. The question in my mind is how we help people to recover

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-13 Thread Joshua D. Drake
Andrew Dunstan wrote: Albe Laurenz wrote: A fix could be either that the server checks escape sequences for validity This strikes me as essential. If the db has a certain encoding ISTM we are promising that all the text data is valid for that encoding. The question in my mind is how

Re: [HACKERS] Bug in UTF8-Validation Code?

2007-03-13 Thread Mario Weilguni
Am Dienstag, 13. März 2007 16:38 schrieb Joshua D. Drake: Andrew Dunstan wrote: Albe Laurenz wrote: A fix could be either that the server checks escape sequences for validity This strikes me as essential. If the db has a certain encoding ISTM we are promising that all the text data is