Re: [HACKERS] is there a deep unyielding reason to limit U&'' literals to ASCII?

2016-01-25 Thread Tom Lane
Robert Haas  writes:
> On Sat, Jan 23, 2016 at 11:27 PM, Chapman Flack  wrote:
>> What I would have expected would be to allow s
>> for any Unicode codepoint that's representable in the server encoding,
>> whatever encoding that is.

> I don't know anything for sure here, but I wonder if it would make
> validating string literals in non-UTF8 encodings significant more
> costly.

I think it would, and it would likely also require function calls to
loadable functions (at least given the current design whereby encoding
conversions are farmed out to loadable libraries).  I do not especially
want the lexer doing that; it will open all sorts of fun questions
involving what we can lex in an already-failed transaction.

It may well be that these issues are surmountable with some sweat,
but it doesn't sound like an easy patch to me.  And how big is the
use-case, really?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] is there a deep unyielding reason to limit U&'' literals to ASCII?

2016-01-25 Thread Robert Haas
On Sat, Jan 23, 2016 at 11:27 PM, Chapman Flack  wrote:
> I see in the documentation (and confirm in practice) that a
> Unicode character string literal U&'...' is only allowed to have
> s representing Unicode characters if the
> server encoding is, exactly and only, UTF8.
>
> Otherwise, it can still have s, but they can only
> be in the range \+01 to \+7f and can only represent ASCII characters
> ... and this isn't just for an ASCII server encoding but for _any server
> encoding other than UTF8_.
>
> I'm a newcomer here, so maybe there was an existing long conversation
> where that was determined to be necessary for some deep reason, and I
> just need to be pointed to it.
>
> What I would have expected would be to allow s
> for any Unicode codepoint that's representable in the server encoding,
> whatever encoding that is. Indeed, that's how I read the SQL standard
> (or my scrounged 2006 draft of it, anyway). The standard even lets
> you precede U& with _charsetname and have the escapes be allowed to
> be any character representable in the specified charset. *That*, I assume,
> would be tough to implement in PostgreSQL, since strings don't walk
> around with their own personal charsets attached. But what's the reason
> for not being able to mention characters available in the server encoding?

I don't know anything for sure here, but I wonder if it would make
validating string literals in non-UTF8 encodings significant more
costly.  When the encoding is UTF-8, the test as to whether the escape
sequence forms a legal code point doesn't require any table lookups.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] is there a deep unyielding reason to limit U&'' literals to ASCII?

2016-01-23 Thread Chapman Flack
I see in the documentation (and confirm in practice) that a
Unicode character string literal U&'...' is only allowed to have
s representing Unicode characters if the
server encoding is, exactly and only, UTF8.

Otherwise, it can still have s, but they can only
be in the range \+01 to \+7f and can only represent ASCII characters
... and this isn't just for an ASCII server encoding but for _any server
encoding other than UTF8_.

I'm a newcomer here, so maybe there was an existing long conversation
where that was determined to be necessary for some deep reason, and I
just need to be pointed to it.

What I would have expected would be to allow s
for any Unicode codepoint that's representable in the server encoding,
whatever encoding that is. Indeed, that's how I read the SQL standard
(or my scrounged 2006 draft of it, anyway). The standard even lets
you precede U& with _charsetname and have the escapes be allowed to
be any character representable in the specified charset. *That*, I assume,
would be tough to implement in PostgreSQL, since strings don't walk
around with their own personal charsets attached. But what's the reason
for not being able to mention characters available in the server encoding?

-Chap


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers