[HACKERS] chr() is still too loose about UTF8 code points

2014-05-16 Thread Tom Lane
Quite some time ago, we made the chr() function accept Unicode code points
up to U+1F, which is the largest value that will fit in a 4-byte UTF8
string.  It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10 (for
compatibility with UTF16).  While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking algorithm
specified by RFC3629, and will therefore reject points above U+10.

This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001f'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
 f1 

 
(1 row)

u8=# \copy tt to 'junk' 
COPY 1
u8=# \copy tt from 'junk'
ERROR:  22021: invalid byte sequence for encoding UTF8: 0xf7 0xbf 0xbf 0xbf
CONTEXT:  COPY tt, line 1
LOCATION:  report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code points
above 10.  Should we back-patch that, or just do it in HEAD?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] chr() is still too loose about UTF8 code points

2014-05-16 Thread Heikki Linnakangas

On 05/16/2014 06:05 PM, Tom Lane wrote:

Quite some time ago, we made the chr() function accept Unicode code points
up to U+1F, which is the largest value that will fit in a 4-byte UTF8
string.  It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10 (for
compatibility with UTF16).  While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking algorithm
specified by RFC3629, and will therefore reject points above U+10.

This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001f'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
  f1


(1 row)

u8=# \copy tt to 'junk'
COPY 1
u8=# \copy tt from 'junk'
ERROR:  22021: invalid byte sequence for encoding UTF8: 0xf7 0xbf 0xbf 0xbf
CONTEXT:  COPY tt, line 1
LOCATION:  report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code points
above 10.  Should we back-patch that, or just do it in HEAD?


+1 for back-patching. A value that cannot be restored is bad, and I 
can't imagine any legitimate use case for producing a Unicode character 
larger than U+10 with chr(x), when the rest of the system doesn't 
handle it. Fully supporting such values might be useful, but that's a 
different story.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] chr() is still too loose about UTF8 code points

2014-05-16 Thread Andrew Dunstan


On 05/16/2014 12:43 PM, Heikki Linnakangas wrote:

On 05/16/2014 06:05 PM, Tom Lane wrote:
Quite some time ago, we made the chr() function accept Unicode code 
points
up to U+1F, which is the largest value that will fit in a 4-byte 
UTF8

string.  It was pointed out to me though that RFC3629 restricted the
original definition of UTF8 to only allow code points up to U+10 
(for

compatibility with UTF16).  While that might not be something we feel we
need to follow exactly, pg_utf8_islegal implements the checking 
algorithm

specified by RFC3629, and will therefore reject points above U+10.

This means you can use chr() to create values that will be rejected on
dump and reload:

u8=# create table tt (f1 text);
CREATE TABLE
u8=# insert into tt values(chr('x001f'::bit(32)::int));
INSERT 0 1
u8=# select * from tt;
  f1


(1 row)

u8=# \copy tt to 'junk'
COPY 1
u8=# \copy tt from 'junk'
ERROR:  22021: invalid byte sequence for encoding UTF8: 0xf7 0xbf 
0xbf 0xbf

CONTEXT:  COPY tt, line 1
LOCATION:  report_invalid_encoding, wchar.c:2011

I think this probably means we need to change chr() to reject code 
points

above 10.  Should we back-patch that, or just do it in HEAD?


+1 for back-patching. A value that cannot be restored is bad, and I 
can't imagine any legitimate use case for producing a Unicode 
character larger than U+10 with chr(x), when the rest of the 
system doesn't handle it. Fully supporting such values might be 
useful, but that's a different story.





My understanding us that U+10 is the highest legal Unicode code 
point anyway. So this is really just tightening our routines to make 
sure we don't produce an invalid value. We won't be disallowing anything 
that is legal Unicode.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] chr() is still too loose about UTF8 code points

2014-05-16 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 05/16/2014 06:05 PM, Tom Lane wrote:
 I think this probably means we need to change chr() to reject code points
 above 10.  Should we back-patch that, or just do it in HEAD?

 +1 for back-patching. A value that cannot be restored is bad, and I 
 can't imagine any legitimate use case for producing a Unicode character 
 larger than U+10 with chr(x), when the rest of the system doesn't 
 handle it. Fully supporting such values might be useful, but that's a 
 different story.

Well, AFAICT the rest of the system does handle any code point up to
U+1F.  It's only pg_utf8_islegal that's being picky.  So another
possible answer is to weaken the check in pg_utf8_islegal.  However,
that could create interoperability concerns with other software, and
as you say the use-case for larger values seems pretty thin.

Actually, after re-reading the spec there's more to it than this:
chr() will allow creating utf8 sequences that correspond to the
surrogate-pair codes, which are expressly disallowed in UTF8 by
the RFCs.  Maybe we should apply pg_utf8_islegal to the result
string rather than duplicating its checks?

BTW, there are various places that have comments or ifdefd-out code
anticipating possible future support of 5- or 6-byte UTF8 sequences,
which were specified in RFC2279 but then rescinded by RFC3629.
I guess as a matter of cleanup we should think about removing that
stuff.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] chr() is still too loose about UTF8 code points

2014-05-16 Thread Noah Misch
On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote:
 I think this probably means we need to change chr() to reject code points
 above 10.  Should we back-patch that, or just do it in HEAD?

The compatibility risks resemble those associated with the fixes for bug
#9210, so I recommend HEAD only:

http://www.postgresql.org/message-id/flat/20140220043940.ga3064...@tornado.leadboat.com

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] chr() is still too loose about UTF8 code points

2014-05-16 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote:
 I think this probably means we need to change chr() to reject code points
 above 10.  Should we back-patch that, or just do it in HEAD?

 The compatibility risks resemble those associated with the fixes for bug
 #9210, so I recommend HEAD only:

 http://www.postgresql.org/message-id/flat/20140220043940.ga3064...@tornado.leadboat.com

While I'd be willing to ignore that risk so far as code points above
10 go, if we want pg_utf8_islegal to be happy then we will also
have to reject surrogate-pair code points.  It's not beyond the realm
of possibility that somebody is intentionally generating such code
points with chr(), despite the dump/reload hazard.  So now I agree
that this is sounding more like a major-version-only behavioral change.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] chr() is still too loose about UTF8 code points

2014-05-16 Thread David G Johnston
Tom Lane-2 wrote
 Noah Misch lt;

 noah@

 gt; writes:
 On Fri, May 16, 2014 at 11:05:08AM -0400, Tom Lane wrote:
 I think this probably means we need to change chr() to reject code
 points
 above 10.  Should we back-patch that, or just do it in HEAD?
 
 The compatibility risks resemble those associated with the fixes for bug
 #9210, so I recommend HEAD only:
 
 http://www.postgresql.org/message-id/flat/

 20140220043940.GA3064539@.leadboat

 
 While I'd be willing to ignore that risk so far as code points above
 10 go, if we want pg_utf8_islegal to be happy then we will also
 have to reject surrogate-pair code points.  It's not beyond the realm
 of possibility that somebody is intentionally generating such code
 points with chr(), despite the dump/reload hazard.  So now I agree
 that this is sounding more like a major-version-only behavioral change.

I would tend to agree on principle - though since this does fall in a
grey-area does 9.4 qualify for this bug-fix.

David J.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/chr-is-still-too-loose-about-UTF8-code-points-tp5804232p5804270.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] chr() is still too loose about UTF8 code points

2014-05-16 Thread Tom Lane
David G Johnston david.g.johns...@gmail.com writes:
 Tom Lane-2 wrote
 While I'd be willing to ignore that risk so far as code points above
 10 go, if we want pg_utf8_islegal to be happy then we will also
 have to reject surrogate-pair code points.  It's not beyond the realm
 of possibility that somebody is intentionally generating such code
 points with chr(), despite the dump/reload hazard.  So now I agree
 that this is sounding more like a major-version-only behavioral change.

 I would tend to agree on principle - though since this does fall in a
 grey-area does 9.4 qualify for this bug-fix.

I don't think it's too late to change this in 9.4.  The discussion was
about whether to back-patch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers