Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
Marko Kreen wrote: On 9/8/10, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: Although it does seem unnecessary. The reason I asked for this to be spelled out is that ordinarily, a backslash escape \nnn is a very low-level thing that will insert exactly what you say. To me it's quite unexpected that the system would editorialize on that to the extent of replacing two UTF16 surrogate characters by a single code point. That's necessary for correctness because our underlying storage is UTF8, but it's not obvious that it will happen. (As a counterexample, if our underlying storage were UTF16, then very different things would need to happen for the exact same SQL input.) I think a lot of people will have this same question when reading this para, which is why I asked for an explanation there. Ok, but I still don't like the whens. How about: -6-digit form technically makes this unnecessary. (When surrogate -pairs are used when the server encoding is literalUTF8/, they -are first combined into a single code point that is then encoded -in UTF-8.) +6-digit form technically makes this unnecessary. (Surrogate +pairs are not stored directly, but combined into a single +code point that is then encoded in UTF-8.) Applied, thanks. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
On 9/7/10, Peter Eisentraut pete...@gmx.net wrote: On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote: We combine the surrogate pair components to a single code point and encode that in UTF-8. We don't encode the components separately; that would be wrong. Oh, OK. Should the docs make that a bit clearer? Done. This is confusing: (When surrogate pairs are used when the server encoding is literalUTF8/, they are first combined into a single code point that is then encoded in UTF-8.) So something else happens if encoding is not UTF8? I think this part can be simply removed, it does not add anything. Or say that surrogate pairs are only allowed in UTF8 encoding. Reason is that you cannot encode 0..7F codepoints with them, and only those are allowed to be given numerically. But this is already mentioned before. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote: On 9/7/10, Peter Eisentraut pete...@gmx.net wrote: On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote: We combine the surrogate pair components to a single code point and encode that in UTF-8. We don't encode the components separately; that would be wrong. Oh, OK. Should the docs make that a bit clearer? Done. This is confusing: (When surrogate pairs are used when the server encoding is literalUTF8/, they are first combined into a single code point that is then encoded in UTF-8.) So something else happens if encoding is not UTF8? Then you can't specify surrogate pairs because they are outside of the ASCII range, per constraint mentioned earlier in the paragraph. I think this part can be simply removed, it does not add anything. Or say that surrogate pairs are only allowed in UTF8 encoding. Reason is that you cannot encode 0..7F codepoints with them, and only those are allowed to be given numerically. But this is already mentioned before. Well, Tom wanted an additional explanation. I personally agree with you; this is not the place to explain encoding and Unicode internals, when really the code only does what it's supposed to. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
On 9/8/10, Peter Eisentraut pete...@gmx.net wrote: On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote: On 9/7/10, Peter Eisentraut pete...@gmx.net wrote: On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote: We combine the surrogate pair components to a single code point and encode that in UTF-8. We don't encode the components separately; that would be wrong. Oh, OK. Should the docs make that a bit clearer? Done. This is confusing: (When surrogate pairs are used when the server encoding is literalUTF8/, they are first combined into a single code point that is then encoded in UTF-8.) So something else happens if encoding is not UTF8? Then you can't specify surrogate pairs because they are outside of the ASCII range, per constraint mentioned earlier in the paragraph. I think this part can be simply removed, it does not add anything. Or say that surrogate pairs are only allowed in UTF8 encoding. Reason is that you cannot encode 0..7F codepoints with them, and only those are allowed to be given numerically. But this is already mentioned before. Well, Tom wanted an additional explanation. I personally agree with you; this is not the place to explain encoding and Unicode internals, when really the code only does what it's supposed to. Ah OK, I had the impression you changed wording before that too, so then this addition seemed unnecessary. But seems you only changed formatting. Anyway, this when makes it weird. Maybe more concise version: To repeat, surrogate pairs are combined to single character and then encoded, not stored separately. Although it does seem unnecessary. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
Marko Kreen mark...@gmail.com writes: Although it does seem unnecessary. The reason I asked for this to be spelled out is that ordinarily, a backslash escape \nnn is a very low-level thing that will insert exactly what you say. To me it's quite unexpected that the system would editorialize on that to the extent of replacing two UTF16 surrogate characters by a single code point. That's necessary for correctness because our underlying storage is UTF8, but it's not obvious that it will happen. (As a counterexample, if our underlying storage were UTF16, then very different things would need to happen for the exact same SQL input.) I think a lot of people will have this same question when reading this para, which is why I asked for an explanation there. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
On 9/8/10, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: Although it does seem unnecessary. The reason I asked for this to be spelled out is that ordinarily, a backslash escape \nnn is a very low-level thing that will insert exactly what you say. To me it's quite unexpected that the system would editorialize on that to the extent of replacing two UTF16 surrogate characters by a single code point. That's necessary for correctness because our underlying storage is UTF8, but it's not obvious that it will happen. (As a counterexample, if our underlying storage were UTF16, then very different things would need to happen for the exact same SQL input.) I think a lot of people will have this same question when reading this para, which is why I asked for an explanation there. Ok, but I still don't like the whens. How about: -6-digit form technically makes this unnecessary. (When surrogate -pairs are used when the server encoding is literalUTF8/, they -are first combined into a single code point that is then encoded -in UTF-8.) +6-digit form technically makes this unnecessary. (Surrogate +pairs are not stored directly, but combined into a single +code point that is then encoded in UTF-8.) -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote: We combine the surrogate pair components to a single code point and encode that in UTF-8. We don't encode the components separately; that would be wrong. Oh, OK. Should the docs make that a bit clearer? Done. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
* Tom Lane: I just noticed that we are now advertising the ability to insert UTF16 surrogate pairs in strings and identifiers (see section 4.1.2.2 in current docs, in particular). Is this really wise? I thought that surrogate pairs were specifically prohibited in UTF8 strings, because of the security hazards implicit in having more than one way to represent the same code point. There is relatively little risk because surrogate pairs cannot encode characters in the BMP, and presumably, most of the critical characters are located there. However, if this is converted to regular UTF-8, I really question the sense of this. Usually, people want CESU-8 to preserve ordering between languages such as C# and Java and their database, and conversion destroys this property. -- Florian Weimerfwei...@bfk.de BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
On 8/22/10, Peter Eisentraut pete...@gmx.net wrote: On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote: I just noticed that we are now advertising the ability to insert UTF16 surrogate pairs in strings and identifiers (see section 4.1.2.2 in current docs, in particular). Is this really wise? I thought that surrogate pairs were specifically prohibited in UTF8 strings, because of the security hazards implicit in having more than one way to represent the same code point. We combine the surrogate pair components to a single code point and encode that in UTF-8. We don't encode the components separately; that would be wrong. AFAICS our UTF8 validator (pg_utf8_islegal) detects and rejects such sequences, if they are inserted via any means, eg. \x Although it's not very obvious... -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote: I just noticed that we are now advertising the ability to insert UTF16 surrogate pairs in strings and identifiers (see section 4.1.2.2 in current docs, in particular). Is this really wise? I thought that surrogate pairs were specifically prohibited in UTF8 strings, because of the security hazards implicit in having more than one way to represent the same code point. We combine the surrogate pair components to a single code point and encode that in UTF-8. We don't encode the components separately; that would be wrong. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding
Peter Eisentraut pete...@gmx.net writes: On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote: I just noticed that we are now advertising the ability to insert UTF16 surrogate pairs in strings and identifiers (see section 4.1.2.2 in current docs, in particular). Is this really wise? I thought that surrogate pairs were specifically prohibited in UTF8 strings, because of the security hazards implicit in having more than one way to represent the same code point. We combine the surrogate pair components to a single code point and encode that in UTF-8. We don't encode the components separately; that would be wrong. Oh, OK. Should the docs make that a bit clearer? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers