Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2011-02-19 Thread Bruce Momjian
Marko Kreen wrote:
 On 9/8/10, Tom Lane t...@sss.pgh.pa.us wrote:
  Marko Kreen mark...@gmail.com writes:
Although it does seem unnecessary.
 
 
  The reason I asked for this to be spelled out is that ordinarily,
   a backslash escape \nnn is a very low-level thing that will insert
   exactly what you say.  To me it's quite unexpected that the system
   would editorialize on that to the extent of replacing two UTF16
   surrogate characters by a single code point.  That's necessary for
   correctness because our underlying storage is UTF8, but it's not
   obvious that it will happen.  (As a counterexample, if our underlying
   storage were UTF16, then very different things would need to happen
   for the exact same SQL input.)
 
   I think a lot of people will have this same question when reading
   this para, which is why I asked for an explanation there.
 
 Ok, but I still don't like the whens.  How about:
 
 -6-digit form technically makes this unnecessary.  (When surrogate
 -pairs are used when the server encoding is literalUTF8/, they
 -are first combined into a single code point that is then encoded
 -in UTF-8.)
 +6-digit form technically makes this unnecessary.  (Surrogate
 +pairs are not stored directly, but combined into a single
 +code point that is then encoded in UTF-8.)

Applied, thanks.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-09-08 Thread Marko Kreen
On 9/7/10, Peter Eisentraut pete...@gmx.net wrote:
 On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
We combine the surrogate pair components to a single code point and
encode that in UTF-8.  We don't encode the components separately;
   that
would be wrong.
  
   Oh, OK.  Should the docs make that a bit clearer?


 Done.

This is confusing:

 (When surrogate
 pairs are used when the server encoding is literalUTF8/, they
 are first combined into a single code point that is then encoded
 in UTF-8.)

So something else happens if encoding is not UTF8?

I think this part can be simply removed, it does not add anything.

Or say that surrogate pairs are only allowed in UTF8 encoding.
Reason is that you cannot encode 0..7F codepoints with them,
and only those are allowed to be given numerically.  But this is
already mentioned before.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-09-08 Thread Peter Eisentraut
On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote:
 On 9/7/10, Peter Eisentraut pete...@gmx.net wrote:
  On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
 We combine the surrogate pair components to a single code point and
 encode that in UTF-8.  We don't encode the components separately;
that
 would be wrong.
   
Oh, OK.  Should the docs make that a bit clearer?
 
 
  Done.
 
 This is confusing:
 
  (When surrogate
  pairs are used when the server encoding is literalUTF8/, they
  are first combined into a single code point that is then encoded
  in UTF-8.)
 
 So something else happens if encoding is not UTF8?

Then you can't specify surrogate pairs because they are outside of the
ASCII range, per constraint mentioned earlier in the paragraph.

 I think this part can be simply removed, it does not add anything.
 
 Or say that surrogate pairs are only allowed in UTF8 encoding.
 Reason is that you cannot encode 0..7F codepoints with them,
 and only those are allowed to be given numerically.  But this is
 already mentioned before.

Well, Tom wanted an additional explanation.  I personally agree with
you; this is not the place to explain encoding and Unicode internals,
when really the code only does what it's supposed to.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-09-08 Thread Marko Kreen
On 9/8/10, Peter Eisentraut pete...@gmx.net wrote:
 On ons, 2010-09-08 at 10:18 +0300, Marko Kreen wrote:
   On 9/7/10, Peter Eisentraut pete...@gmx.net wrote:
On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
   We combine the surrogate pair components to a single code point and
   encode that in UTF-8.  We don't encode the components separately;
  that
   would be wrong.
 
  Oh, OK.  Should the docs make that a bit clearer?
   
   
Done.
  
   This is confusing:
  
(When surrogate
pairs are used when the server encoding is literalUTF8/, they
are first combined into a single code point that is then encoded
in UTF-8.)
  
   So something else happens if encoding is not UTF8?


 Then you can't specify surrogate pairs because they are outside of the
  ASCII range, per constraint mentioned earlier in the paragraph.


   I think this part can be simply removed, it does not add anything.
  
   Or say that surrogate pairs are only allowed in UTF8 encoding.
   Reason is that you cannot encode 0..7F codepoints with them,
   and only those are allowed to be given numerically.  But this is
   already mentioned before.


 Well, Tom wanted an additional explanation.  I personally agree with
  you; this is not the place to explain encoding and Unicode internals,
  when really the code only does what it's supposed to.

Ah OK, I had the impression you changed wording before that too,
so then this addition seemed unnecessary.  But seems you only changed
formatting.

Anyway, this when makes it weird.  Maybe more concise version:

  To repeat, surrogate pairs are combined to single character and then
  encoded, not stored separately.

Although it does seem unnecessary.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-09-08 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 Although it does seem unnecessary.

The reason I asked for this to be spelled out is that ordinarily,
a backslash escape \nnn is a very low-level thing that will insert
exactly what you say.  To me it's quite unexpected that the system
would editorialize on that to the extent of replacing two UTF16
surrogate characters by a single code point.  That's necessary for
correctness because our underlying storage is UTF8, but it's not
obvious that it will happen.  (As a counterexample, if our underlying
storage were UTF16, then very different things would need to happen
for the exact same SQL input.)

I think a lot of people will have this same question when reading
this para, which is why I asked for an explanation there.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-09-08 Thread Marko Kreen
On 9/8/10, Tom Lane t...@sss.pgh.pa.us wrote:
 Marko Kreen mark...@gmail.com writes:
   Although it does seem unnecessary.


 The reason I asked for this to be spelled out is that ordinarily,
  a backslash escape \nnn is a very low-level thing that will insert
  exactly what you say.  To me it's quite unexpected that the system
  would editorialize on that to the extent of replacing two UTF16
  surrogate characters by a single code point.  That's necessary for
  correctness because our underlying storage is UTF8, but it's not
  obvious that it will happen.  (As a counterexample, if our underlying
  storage were UTF16, then very different things would need to happen
  for the exact same SQL input.)

  I think a lot of people will have this same question when reading
  this para, which is why I asked for an explanation there.

Ok, but I still don't like the whens.  How about:

-6-digit form technically makes this unnecessary.  (When surrogate
-pairs are used when the server encoding is literalUTF8/, they
-are first combined into a single code point that is then encoded
-in UTF-8.)
+6-digit form technically makes this unnecessary.  (Surrogate
+pairs are not stored directly, but combined into a single
+code point that is then encoded in UTF-8.)

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-09-07 Thread Peter Eisentraut
On sön, 2010-08-22 at 15:15 -0400, Tom Lane wrote:
  We combine the surrogate pair components to a single code point and
  encode that in UTF-8.  We don't encode the components separately;
 that
  would be wrong.
 
 Oh, OK.  Should the docs make that a bit clearer?

Done.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-08-23 Thread Florian Weimer
* Tom Lane:

 I just noticed that we are now advertising the ability to insert UTF16
 surrogate pairs in strings and identifiers (see section 4.1.2.2 in
 current docs, in particular).  Is this really wise?  I thought that
 surrogate pairs were specifically prohibited in UTF8 strings, because
 of the security hazards implicit in having more than one way to
 represent the same code point.

There is relatively little risk because surrogate pairs cannot encode
characters in the BMP, and presumably, most of the critical characters
are located there.

However, if this is converted to regular UTF-8, I really question the
sense of this.  Usually, people want CESU-8 to preserve ordering
between languages such as C# and Java and their database, and
conversion destroys this property.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-08-23 Thread Marko Kreen
On 8/22/10, Peter Eisentraut pete...@gmx.net wrote:
 On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:
   I just noticed that we are now advertising the ability to insert UTF16
   surrogate pairs in strings and identifiers (see section 4.1.2.2 in
   current docs, in particular).  Is this really wise?  I thought that
   surrogate pairs were specifically prohibited in UTF8 strings, because
   of the security hazards implicit in having more than one way to
   represent the same code point.


 We combine the surrogate pair components to a single code point and
  encode that in UTF-8.  We don't encode the components separately; that
  would be wrong.

AFAICS our UTF8 validator (pg_utf8_islegal) detects and rejects
such sequences, if they are inserted via any means, eg. \x

Although it's not very obvious...

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-08-22 Thread Peter Eisentraut
On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:
 I just noticed that we are now advertising the ability to insert UTF16
 surrogate pairs in strings and identifiers (see section 4.1.2.2 in
 current docs, in particular).  Is this really wise?  I thought that
 surrogate pairs were specifically prohibited in UTF8 strings, because
 of the security hazards implicit in having more than one way to
 represent the same code point.

We combine the surrogate pair components to a single code point and
encode that in UTF-8.  We don't encode the components separately; that
would be wrong.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UTF16 surrogate pairs in UTF8 encoding

2010-08-22 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On sön, 2010-08-22 at 14:29 -0400, Tom Lane wrote:
 I just noticed that we are now advertising the ability to insert UTF16
 surrogate pairs in strings and identifiers (see section 4.1.2.2 in
 current docs, in particular).  Is this really wise?  I thought that
 surrogate pairs were specifically prohibited in UTF8 strings, because
 of the security hazards implicit in having more than one way to
 represent the same code point.

 We combine the surrogate pair components to a single code point and
 encode that in UTF-8.  We don't encode the components separately; that
 would be wrong.

Oh, OK.  Should the docs make that a bit clearer?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers