Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-25 Thread Andrew Dunstan



Marko Kreen wrote:

On 9/25/09, to...@tuxteam.de  wrote:
  

 On Thu, Sep 24, 2009 at 09:42:32PM +0300, Peter Eisentraut wrote:
 > Good idea.  This could also check for other invalid things like
 > byte-order marks in UTF-8.

But watch out. Microsoft apps do like to insert a BOM at the beginning
 of the text. Not that I think it's a good idea, but the Unicode folks
 seem to think its OK [1] :-(



As BOM  does not actively break transport layers, it's less clear-cut
whether to reject it.  It could be said that BOM at the start of string
is OK.  BOM at the middle of string is more rejectable.  But it will
only confuse some high-level character counters, not low-level encoders.

  


It seems pretty clear from the URL that Tomas posted that we should not 
treat a BOM specially at all, and just treat it as another Unicode char.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-25 Thread Marko Kreen
On 9/25/09, to...@tuxteam.de  wrote:
>  On Thu, Sep 24, 2009 at 09:42:32PM +0300, Peter Eisentraut wrote:
>  > Good idea.  This could also check for other invalid things like
>  > byte-order marks in UTF-8.
>
> But watch out. Microsoft apps do like to insert a BOM at the beginning
>  of the text. Not that I think it's a good idea, but the Unicode folks
>  seem to think its OK [1] :-(

As BOM  does not actively break transport layers, it's less clear-cut
whether to reject it.  It could be said that BOM at the start of string
is OK.  BOM at the middle of string is more rejectable.  But it will
only confuse some high-level character counters, not low-level encoders.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-24 Thread tomas
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Thu, Sep 24, 2009 at 09:42:32PM +0300, Peter Eisentraut wrote:
> On Wed, 2009-09-23 at 22:46 +0300, Marko Kreen wrote:

[...]

> Good idea.  This could also check for other invalid things like
> byte-order marks in UTF-8.

But watch out. Microsoft apps do like to insert a BOM at the beginning
of the text. Not that I think it's a good idea, but the Unicode folks
seem to think its OK [1] :-(

  

Regards
- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFKvFtCBcgs9XrR2kYRAsHXAJ9lpaqZ2IFKGwZd+H3Ej6H+m44vpgCeLe7n
vc+ciE1N5AqOre3DmvwKaNI=
=UTBQ
-END PGP SIGNATURE-

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-24 Thread Peter Eisentraut
On Wed, 2009-09-23 at 22:46 +0300, Marko Kreen wrote:
> I looked at your code for U& and saw that you allow standalone
> second half of the surrogate pair there, although you error
> out on first half.  Was that deliberate?

No.

> Perhaps pg_verifymbstr() should be made to check for such values,
> because even if we fix the escaping code, such data can still be
> inserted via plain utf8 or \x escapes?

Good idea.  This could also check for other invalid things like
byte-order marks in UTF-8.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-23 Thread Marko Kreen
On 9/23/09, Peter Eisentraut  wrote:
> On Wed, 2009-09-09 at 18:26 +0300, Marko Kreen wrote:
> > Unicode escapes for extended strings.
>
> Committed.

Thank you for handling the patch.


I looked at your code for U& and saw that you allow standalone
second half of the surrogate pair there, although you error
out on first half.  Was that deliberate?

Standalone surrogate halfs cause headaches for anything that wants to
handle data in UTF16.  The area 0xD800-0xDFFF is explicitly reserved
for UTF16 encoding and does not contain any valid Unicode codepoints.

Perhaps pg_verifymbstr() should be made to check for such values,
because even if we fix the escaping code, such data can still be
inserted via plain utf8 or \x escapes?

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-22 Thread Peter Eisentraut
On Wed, 2009-09-09 at 18:26 +0300, Marko Kreen wrote:
> Unicode escapes for extended strings.
> 
> On 4/16/09, Marko Kreen  wrote:
> > Reasons:
> >
> >  - More people are familiar with \u escaping, as it's standard
> >   in Java/C#/Python, probably more..
> >  - U& strings will not work when stdstr=off.
> >
> >  Syntax:
> >
> >   \u  - 16-bit value
> >   \U  - 32-bit value
> >
> >  Additionally, both \u and \U can be used to specify UTF-16 surrogate
> >  pairs to encode characters with value > 0x.  This is exact behaviour
> >  used by Java/C#/Python.  (except that Java does not have \U)
> 
> v3 of the patch:
> 
> - convert to new reentrant lexer API
> - add lexer targets to avoid fallback to default
> - completely disallow \U\u without proper number of hex values
> - fix logic bug in surrogate pair handling

Committed.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-21 Thread Peter Eisentraut
On Wed, 2009-09-09 at 18:26 +0300, Marko Kreen wrote:
> Unicode escapes for extended strings.
> 
> On 4/16/09, Marko Kreen  wrote:
> > Reasons:
> >
> >  - More people are familiar with \u escaping, as it's standard
> >   in Java/C#/Python, probably more..
> >  - U& strings will not work when stdstr=off.
> >
> >  Syntax:
> >
> >   \u  - 16-bit value
> >   \U  - 32-bit value
> >
> >  Additionally, both \u and \U can be used to specify UTF-16 surrogate
> >  pairs to encode characters with value > 0x.  This is exact behaviour
> >  used by Java/C#/Python.  (except that Java does not have \U)
> 
> v3 of the patch:
> 
> - convert to new reentrant lexer API
> - add lexer targets to avoid fallback to default
> - completely disallow \U\u without proper number of hex values
> - fix logic bug in surrogate pair handling

This looks good to me.  I'm implementing the surrogate pair handling for
the U& syntax for consistency.  Then I'll apply this.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-09-09 Thread Marko Kreen
Unicode escapes for extended strings.

On 4/16/09, Marko Kreen  wrote:
> Reasons:
>
>  - More people are familiar with \u escaping, as it's standard
>   in Java/C#/Python, probably more..
>  - U& strings will not work when stdstr=off.
>
>  Syntax:
>
>   \u  - 16-bit value
>   \U  - 32-bit value
>
>  Additionally, both \u and \U can be used to specify UTF-16 surrogate
>  pairs to encode characters with value > 0x.  This is exact behaviour
>  used by Java/C#/Python.  (except that Java does not have \U)

v3 of the patch:

- convert to new reentrant lexer API
- add lexer targets to avoid fallback to default
- completely disallow \U\u without proper number of hex values
- fix logic bug in surrogate pair handling

-- 
marko
diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 7637eab..b6f26cc 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -394,6 +394,14 @@ SELECT 'foo'  'bar';
 
 hexadecimal byte value

+   
+
+ \u,
+ \U
+ (x = 0 - 9, A - F)
+
+16 or 32-bit hexadecimal Unicode character value.
+   
   
   
  
@@ -407,6 +415,14 @@ SELECT 'foo'  'bar';
 
 
 
+	 The Unicode escape syntax works fully only when the server encoding is UTF8.
+	 When other server encodings are used, only code points in the ASCII range
+	 (up to \u007F) can be specified.  Both \u and \U
+	 can also be used to specify UTF-16 surrogate pair to escape characters
+	 with value larger than \u.
+	
+
+
  It is your responsibility that the byte sequences you create are
  valid characters in the server character set encoding.  When the
  server encoding is UTF-8, then the alternative Unicode escape
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index f404f9d..8ca3007 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -97,6 +97,8 @@ static void check_escape_warning(base_yyscan_t yyscanner);
 extern int	base_yyget_column(yyscan_t yyscanner);
 extern void base_yyset_column(int column_no, yyscan_t yyscanner);
 
+static void addunicode(pg_wchar c, yyscan_t yyscanner);
+
 %}
 
 %option reentrant
@@ -134,6 +136,7 @@ extern void base_yyset_column(int column_no, yyscan_t yyscanner);
  *   $foo$ quoted strings
  *   quoted identifier with Unicode escapes
  *   quoted string with Unicode escapes
+ *   Unicode surrogate escape in extended string
  */
 
 %x xb
@@ -145,6 +148,7 @@ extern void base_yyset_column(int column_no, yyscan_t yyscanner);
 %x xdolq
 %x xui
 %x xus
+%x xeu
 
 /*
  * In order to make the world safe for Windows and Mac clients as well as
@@ -223,6 +227,8 @@ xeinside		[^\\']+
 xeescape		[\\][^0-7]
 xeoctesc		[\\][0-7]{1,3}
 xehexesc		[\\]x[0-9A-Fa-f]{1,2}
+xeunicode		[\\](u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})
+xeunicodebad	[\\]([uU])
 
 /* Extended quote
  * xqdouble implements embedded quote, 
@@ -535,6 +541,51 @@ other			.
 {xeinside}  {
 	addlit(yytext, yyleng, yyscanner);
 }
+{xeunicode} {
+	pg_wchar c = strtoul(yytext+2, NULL, 16);
+
+	check_escape_warning(yyscanner);
+
+	/*
+	 * handle UTF-16 surrogates:
+	 *   [0xD800..0xDC00) - first elem.
+	 *   [0xDC00..0xE000) - second elem.
+	 */
+	if (c >= 0xD800 && c < 0xE000)
+	{
+		if (c >= 0xDC00)
+			yyerror("invalid Unicode surrogate pair");
+
+		yyextra->utf16_top_part = ((c & 0x3FF) << 10) + 0x1;
+		BEGIN(xeu);
+	}
+	else
+		addunicode(c, yyscanner);
+}
+{xeunicode} {
+	pg_wchar c = strtoul(yytext+2, NULL, 16);
+
+	if (c < 0xDC00 || c >= 0xE000)
+		yyerror("invalid Unicode surrogate pair");
+
+	c = (c & 0x3FF) + yyextra->utf16_top_part;
+
+	addunicode(c, yyscanner);
+
+	BEGIN(xe);
+}
+.			|
+\n			|
+<>	{ yyerror("invalid Unicode surrogate pair"); }
+
+{xeunicodebad}	{
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
+ errmsg("invalid Unicode escape"),
+ errhint("Unicode escapes must be full-length: \\u or \\U."),
+ lexer_errposition()));
+	}
+
 {xeescape}  {
 	if (yytext[1] == '\'')
 	{
@@ -1263,3 +1314,21 @@ base_yyfree(void *ptr, base_yyscan_t yyscanner)
 	if (ptr)
 		pfree(ptr);
 }
+
+static void
+addunicode(pg_wchar c, base_yyscan_t yyscanner)
+{
+	char buf[8];
+
+	if (c == 0 || c > 0x10)
+		yyerror("invalid Unicode escape value");
+	if (c > 0x7F)
+	{
+		if (GetDatabaseEncoding() != PG_UTF8)
+			yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
+		yyextra->saw_non_ascii = true;
+	}
+	unicode_to_utf8(c, (unsigned char *)buf);
+	addlit(buf, pg_mblen(buf), yyscanner);
+}
+
diff --git a/src/include/parser/gramparse.h b/src/include/parser/gramparse.h
index a54a1b1..0ef9bf4 100644
--- a/src/include/parser/gramparse.h
+++ b/src/include/parser/gramparse.h
@@ -71,6 +71,9 @@ typedef stru

Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-19 Thread Tom Lane
Marko Kreen  writes:
> On 4/18/09, Tom Lane  wrote:
>> The point has come up before, and I kinda thought we *had* changed the
>> lexer to reject \000.  I see we haven't though.  Curiously, this
>> does fail:
>> 
>> regression=# select U&'abc\xyz';
>> ERROR:  invalid byte sequence for encoding "SQL_ASCII": 0x00

> I think that's because out verifier actually *does* reject \0,
> only problem is that \0 does not set saw_high_bit flag,
> so the verifier simply does not get executed.
> But U& executes it always.

I fixed this in HEAD.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-18 Thread Marko Kreen
On 4/18/09, Tom Lane  wrote:
> Sam Mason  writes:
>  > On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote:
>  >> On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote:
>  >>> Btw, is there any good reason why we don't reject \000, \x00
>  >>> in text strings?
>  >>
>  >> Why forbid nulls in text strings?
>
>  > As far as I know, PG assumes, like most C code, that strings don't
>  > contain embedded NUL characters.
>
>
> Yeah; we should reject them because nothing will behave very sensibly
>  with them, eg
>
>  regression=# select E'abc\000xyz';
>   ?column?
>  --
>   abc
>  (1 row)
>
>  The point has come up before, and I kinda thought we *had* changed the
>  lexer to reject \000.  I see we haven't though.  Curiously, this
>  does fail:
>
>  regression=# select U&'abc\xyz';
>  ERROR:  invalid byte sequence for encoding "SQL_ASCII": 0x00
>  HINT:  This error can also happen if the byte sequence does not match the 
> encoding expected by the server, which is controlled by "client_encoding".
>
>  though that's not quite the message I'd have expected to see.

I think that's because out verifier actually *does* reject \0,
only problem is that \0 does not set saw_high_bit flag,
so the verifier simply does not get executed.
But U& executes it always.

unicode=# SELECT e'\xc3\xa4';
 ?column?
--
 ä
(1 row)

unicode=# SELECT e'\xc3\xa4\x00';
ERROR:  invalid byte sequence for encoding "UTF8": 0x00
HINT:  This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

Heh.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Kevin Grittner
Tom Lane  wrote:
 
> The lexer is *not* allowed to invoke any database operations
> (such as pg_conversion lookups)
 
I certainly hope it's not!
 
> so it cannot perform arbitrary encoding conversions.
 
I was more questioning whether we should be looking at character
encodings at all at that point, rather than suggesting conversions
between different ones.  If committing the escape sequence to a
particular encoding is unavoidable at that point, then I suppose the
code in question is about as good as it gets.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Marko Kreen
On 4/18/09, Tom Lane  wrote:
> "Kevin Grittner"  writes:
>  > Andrew Dunstan  wrote:
>  >> ISTM that one of the uses of this is to say "store the character
>  >> that corresponds to this Unicode code point in whatever the database
>  >> encoding is"
>
>  > I would think you're right.  As long as the given character is in the
>  > user's character set, we should allow it.  Presumably we've already
>  > confirmed that they have an encoding scheme which allows them to store
>  > everything in their character set.
>
>
> This is a good way to get your patch rejected altogether.  The lexer
>  is *not* allowed to invoke any database operations (such as
>  pg_conversion lookups) so it cannot perform arbitrary encoding
>  conversions.

Ok.  I was just thinking that if such conversion can be provided easily,
it should be done.  But if not, then no need to make things complex.

Seems the proper way to look at it is that unicode escapes have
straightforward meaning only in UTF8 encoding.  So it should be
fine to limit them in other encodings to ascii.

>  If this sort of facility is what you want, the previously suggested
>  approach via a decode-like runtime function is a better fit.

I'm a UTF8-only kind on guy, so people who actually have experience
of using other encodings must comment on that one.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Tom Lane
Sam Mason  writes:
> On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote:
>> On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote:
>>> Btw, is there any good reason why we don't reject \000, \x00
>>> in text strings?
>> 
>> Why forbid nulls in text strings?

> As far as I know, PG assumes, like most C code, that strings don't
> contain embedded NUL characters.

Yeah; we should reject them because nothing will behave very sensibly
with them, eg

regression=# select E'abc\000xyz';
 ?column? 
--
 abc
(1 row)

The point has come up before, and I kinda thought we *had* changed the
lexer to reject \000.  I see we haven't though.  Curiously, this
does fail:

regression=# select U&'abc\xyz';
ERROR:  invalid byte sequence for encoding "SQL_ASCII": 0x00
HINT:  This error can also happen if the byte sequence does not match the 
encoding expected by the server, which is controlled by "client_encoding".

though that's not quite the message I'd have expected to see.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Tom Lane
"Kevin Grittner"  writes:
> Andrew Dunstan  wrote:
>> ISTM that one of the uses of this is to say "store the character
>> that corresponds to this Unicode code point in whatever the database
>> encoding is"
 
> I would think you're right.  As long as the given character is in the
> user's character set, we should allow it.  Presumably we've already
> confirmed that they have an encoding scheme which allows them to store
> everything in their character set.

This is a good way to get your patch rejected altogether.  The lexer
is *not* allowed to invoke any database operations (such as
pg_conversion lookups) so it cannot perform arbitrary encoding
conversions.

If this sort of facility is what you want, the previously suggested
approach via a decode-like runtime function is a better fit.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Andrew Dunstan



Marko Kreen wrote:

On 4/17/09, Kevin Grittner  wrote:
  

Andrew Dunstan  wrote:
 > ISTM that one of the uses of this is to say "store the character
 > that corresponds to this Unicode code point in whatever the database
 > encoding is"

I would think you're right.  As long as the given character is in the
 user's character set, we should allow it.  Presumably we've already
 confirmed that they have an encoding scheme which allows them to store
 everything in their character set.



It is probably good idea, but currently I just followed what the U&
strings do.

I can change my patch to do it, but it is probably more urgent in U&
case to decide whether they should work in other encodings too.

  


Indeed. What does the standard say about the behaviour of U&'' ?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Marko Kreen
On 4/17/09, Kevin Grittner  wrote:
> Andrew Dunstan  wrote:
>  > ISTM that one of the uses of this is to say "store the character
>  > that corresponds to this Unicode code point in whatever the database
>  > encoding is"
>
> I would think you're right.  As long as the given character is in the
>  user's character set, we should allow it.  Presumably we've already
>  confirmed that they have an encoding scheme which allows them to store
>  everything in their character set.

It is probably good idea, but currently I just followed what the U&
strings do.

I can change my patch to do it, but it is probably more urgent in U&
case to decide whether they should work in other encodings too.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Kevin Grittner
Andrew Dunstan  wrote:
 
> ISTM that one of the uses of this is to say "store the character
> that corresponds to this Unicode code point in whatever the database
> encoding is"
 
I would think you're right.  As long as the given character is in the
user's character set, we should allow it.  Presumably we've already
confirmed that they have an encoding scheme which allows them to store
everything in their character set.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Andrew Dunstan



Marko Kreen wrote:

+   if (c > 0x7F)
+   {
+   if (GetDatabaseEncoding() != PG_UTF8)
+   yyerror("Unicode escape values cannot be used for code point 
values above 007F when the server encoding is not UTF8");
+   saw_high_bit = true;
+   }
  


Is that really what we want to do? ISTM that one of the uses of this is 
to say "store the character that corresponds to this Unicode code point 
in whatever the database encoding is", so that \u00a9 would become an 
encoding independent way of designating the copyright symbol, for instance.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Sam Mason
On Fri, Apr 17, 2009 at 07:01:47PM +0200, Martijn van Oosterhout wrote:
> On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote:
> > Btw, is there any good reason why we don't reject \000, \x00
> > in text strings?
> 
> Why forbid nulls in text strings?

As far as I know, PG assumes, like most C code, that strings don't
contain embedded NUL characters.  The manual[1] has this to says:

  The character with the code zero cannot be in a string constant.

I believe you're supposed to use values of type "bytea" when you're
expecting to deal with NUL characters.

-- 
  Sam  http://samason.me.uk/
 
 [1] 
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Martijn van Oosterhout
On Fri, Apr 17, 2009 at 07:07:31PM +0300, Marko Kreen wrote:
> Btw, is there any good reason why we don't reject \000, \x00
> in text strings?

Why forbid nulls in text strings?

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while 
> boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-17 Thread Marko Kreen
On 4/16/09, Marko Kreen  wrote:
> It's up to UTF8 validator whether to consider non-characters as error.

I checked, and it did not work well, as addunicode() did not set
the saw_high_bit variable.when outputting UTF8.  Attached patch fixes it.

Currently is would be NOP as pg_verifymbstr() only checks for invalid UTF8,
and addunicode cannot output it, but in the future we may want to reject
some codes, so now it can.

Btw, is there any good reason why we don't reject \000, \x00
in text strings?

Currently I made addunicode() do it, because it seems sensible.

-- 
marko
diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index a559d75..fdb0cc5 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -394,6 +394,14 @@ SELECT 'foo'  'bar';
 
 hexadecimal byte value

+   
+
+ \u,
+ \U
+ (x = 0 - 9, A - F)
+
+16 or 32-bit hexadecimal Unicode character value.
+   
   
   
  
@@ -407,6 +415,14 @@ SELECT 'foo'  'bar';
 
 
 
+	 The Unicode escape syntax works fully only when the server encoding is UTF8.
+	 When other server encodings are used, only code points in the ASCII range
+	 (up to \u007F) can be specified.  Both \u and \U
+	 can also be used to specify UTF-16 surrogate pair to escape characters
+	 with value larger than \u.
+	
+
+
  It is your responsibility that the byte sequences you create are
  valid characters in the server character set encoding.  When the
  server encoding is UTF-8, then the alternative Unicode escape
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index a070e85..992cc9a 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -98,6 +98,11 @@ static char *scanbuf;
 
 static unsigned char unescape_single_char(unsigned char c);
 
+/* first part of unicode surrogate */
+static unsigned long xeu_surrogate1;
+
+static void addunicode(pg_wchar c);
+
 %}
 
 %option 8bit
@@ -128,6 +133,7 @@ static unsigned char unescape_single_char(unsigned char c);
  *   $foo$ quoted strings
  *   quoted identifier with Unicode escapes
  *   quoted string with Unicode escapes
+ *   Unicode surrogate escape in extended string
  */
 
 %x xb
@@ -139,6 +145,7 @@ static unsigned char unescape_single_char(unsigned char c);
 %x xdolq
 %x xui
 %x xus
+%x xeu
 
 /*
  * In order to make the world safe for Windows and Mac clients as well as
@@ -217,6 +224,7 @@ xeinside		[^\\']+
 xeescape		[\\][^0-7]
 xeoctesc		[\\][0-7]{1,3}
 xehexesc		[\\]x[0-9A-Fa-f]{1,2}
+xeunicode		[\\](u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})
 
 /* Extended quote
  * xqdouble implements embedded quote, 
@@ -506,6 +514,37 @@ other			.
 {xeinside}  {
 	addlit(yytext, yyleng);
 }
+{xeunicode} {
+	pg_wchar c = strtoul(yytext+2, NULL, 16);
+
+	check_escape_warning();
+
+	if (c >= 0xD800 && c < 0xDC00)
+	{
+		xeu_surrogate1 = c;
+		BEGIN(xeu);
+	}
+	else if (c >= 0xDC00 && c < 0xE000)
+		yyerror("invalid Unicode escape value");
+
+	addunicode(c);
+}
+{xeunicode} {
+	pg_wchar c = strtoul(yytext+2, NULL, 16);
+
+	if (c < 0xDC00 || c >= 0xE000)
+		yyerror("invalid Unicode surrogate pair");
+
+	c = ((xeu_surrogate1 & 0x3FF) << 10) | (c & 0x3FF);
+
+	addunicode(c + 0x1);
+
+	BEGIN(xe);
+}
+.			{
+	yyerror("invalid Unicode surrogate pair");
+}
+
 {xeescape}  {
 	if (yytext[1] == '\'')
 	{
@@ -1153,3 +1192,21 @@ check_escape_warning(void)
  lexer_errposition()));
 	warn_on_first_escape = false;	/* warn only once per string */
 }
+
+static void
+addunicode(pg_wchar c)
+{
+	char buf[8];
+
+	if (c == 0 || c > 0x10)
+		yyerror("invalid Unicode escape value");
+	if (c > 0x7F)
+	{
+		if (GetDatabaseEncoding() != PG_UTF8)
+			yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
+		saw_high_bit = true;
+	}
+	unicode_to_utf8(c, (unsigned char *)buf);
+	addlit(buf, pg_mblen(buf));
+}
+

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Marko Kreen
On 4/16/09, Sam Mason  wrote:
> On Thu, Apr 16, 2009 at 08:48:58PM +0300, Marko Kreen wrote:
>  > Seems I'm bad at communicating in english,
>
>
> I hope you're not saying this because of my misunderstandings!
>
>
>  > so here is C variant of
>  > my proposal to bring \u escaping into extended strings.  Reasons:
>  >
>  > - More people are familiar with \u escaping, as it's standard
>  >   in Java/C#/Python, probably more..
>  > - U& strings will not work when stdstr=off.
>  >
>  > Syntax:
>  >
>  >   \u  - 16-bit value
>  >   \U  - 32-bit value
>  >
>  > Additionally, both \u and \U can be used to specify UTF-16 surrogate
>  > pairs to encode characters with value > 0x.  This is exact behaviour
>  > used by Java/C#/Python.  (except that Java does not have \U)
>
>
> Are you sure that this handling of surrogates is correct?  The best
>  answer I've managed to find on the Unicode consortium's site is:
>
>   http://unicode.org/faq/utf_bom.html#utf16-7
>
>  it says:
>
>   They are invalid in interchange, but may be freely used internal to an
>   implementation.
>
>  I think this means they consider the handling of them you noted above,
>  in other languages, to be an error.

It's up to UTF8 validator whether to consider non-characters as error.

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 03:04:37PM -0400, Andrew Dunstan wrote:
> Sam Mason wrote:
> >Are you sure that this handling of surrogates is correct?  The best
> >answer I've managed to find on the Unicode consortium's site is:
> >
> >  http://unicode.org/faq/utf_bom.html#utf16-7
> >
> >it says:
> >
> >  They are invalid in interchange, but may be freely used internal to an
> >  implementation.
> 
> It says that about non-characters, not about the use of surrogate pairs, 
> unless I am misreading it.

No, I think you're probably right and I was misreading it.  I went
back and forth several times to explicitly check I was interpreting
this correctly and still failed to get it right.  Not sure what I was
thinking and sorry for the hassle Marko!

I've already asked on the Unicode list about this (no response yet), but
I have a feeling I'm getting worked up over nothing.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Andrew Dunstan



Sam Mason wrote:

Are you sure that this handling of surrogates is correct?  The best
answer I've managed to find on the Unicode consortium's site is:

  http://unicode.org/faq/utf_bom.html#utf16-7

it says:

  They are invalid in interchange, but may be freely used internal to an
  implementation.
  




It says that about non-characters, not about the use of surrogate pairs, 
unless I am misreading it.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Sam Mason
On Thu, Apr 16, 2009 at 08:48:58PM +0300, Marko Kreen wrote:
> Seems I'm bad at communicating in english,

I hope you're not saying this because of my misunderstandings!

> so here is C variant of
> my proposal to bring \u escaping into extended strings.  Reasons:
> 
> - More people are familiar with \u escaping, as it's standard
>   in Java/C#/Python, probably more..
> - U& strings will not work when stdstr=off.
> 
> Syntax:
> 
>   \u  - 16-bit value
>   \U  - 32-bit value
> 
> Additionally, both \u and \U can be used to specify UTF-16 surrogate
> pairs to encode characters with value > 0x.  This is exact behaviour
> used by Java/C#/Python.  (except that Java does not have \U)

Are you sure that this handling of surrogates is correct?  The best
answer I've managed to find on the Unicode consortium's site is:

  http://unicode.org/faq/utf_bom.html#utf16-7

it says:

  They are invalid in interchange, but may be freely used internal to an
  implementation.

I think this means they consider the handling of them you noted above,
in other languages, to be an error.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [rfc] unicode escapes for extended strings

2009-04-16 Thread Marko Kreen
Seems I'm bad at communicating in english, so here is C variant of
my proposal to bring \u escaping into extended strings.  Reasons:

- More people are familiar with \u escaping, as it's standard
  in Java/C#/Python, probably more..
- U& strings will not work when stdstr=off.

Syntax:

  \u  - 16-bit value
  \U  - 32-bit value

Additionally, both \u and \U can be used to specify UTF-16 surrogate
pairs to encode characters with value > 0x.  This is exact behaviour
used by Java/C#/Python.  (except that Java does not have \U)


I'm ok with this patch left to 8.5.

-- 
marko
diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index a559d75..fdb0cc5 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -394,6 +394,14 @@ SELECT 'foo'  'bar';
 
 hexadecimal byte value

+   
+
+ \u,
+ \U
+ (x = 0 - 9, A - F)
+
+16 or 32-bit hexadecimal Unicode character value.
+   
   
   
  
@@ -407,6 +415,14 @@ SELECT 'foo'  'bar';
 
 
 
+	 The Unicode escape syntax works fully only when the server encoding is UTF8.
+	 When other server encodings are used, only code points in the ASCII range
+	 (up to \u007F) can be specified.  Both \u and \U
+	 can also be used to specify UTF-16 surrogate pair to escape characters
+	 with value larger than \u.
+	
+
+
  It is your responsibility that the byte sequences you create are
  valid characters in the server character set encoding.  When the
  server encoding is UTF-8, then the alternative Unicode escape
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index a070e85..c0695f1 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -98,6 +98,11 @@ static char *scanbuf;
 
 static unsigned char unescape_single_char(unsigned char c);
 
+/* first part of unicode surrogate */
+static unsigned long xeu_surrogate1;
+
+static void addunicode(pg_wchar c);
+
 %}
 
 %option 8bit
@@ -128,6 +133,7 @@ static unsigned char unescape_single_char(unsigned char c);
  *   $foo$ quoted strings
  *   quoted identifier with Unicode escapes
  *   quoted string with Unicode escapes
+ *   Unicode surrogate escape in extended string
  */
 
 %x xb
@@ -139,6 +145,7 @@ static unsigned char unescape_single_char(unsigned char c);
 %x xdolq
 %x xui
 %x xus
+%x xeu
 
 /*
  * In order to make the world safe for Windows and Mac clients as well as
@@ -217,6 +224,7 @@ xeinside		[^\\']+
 xeescape		[\\][^0-7]
 xeoctesc		[\\][0-7]{1,3}
 xehexesc		[\\]x[0-9A-Fa-f]{1,2}
+xeunicode		[\\](u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8})
 
 /* Extended quote
  * xqdouble implements embedded quote, 
@@ -506,6 +514,37 @@ other			.
 {xeinside}  {
 	addlit(yytext, yyleng);
 }
+{xeunicode} {
+	pg_wchar c = strtoul(yytext+2, NULL, 16);
+
+	check_escape_warning();
+
+	if (c >= 0xD800 && c < 0xDC00)
+	{
+		xeu_surrogate1 = c;
+		BEGIN(xeu);
+	}
+	else if (c >= 0xDC00 && c < 0xE000)
+		yyerror("invalid Unicode escape value");
+
+	addunicode(c);
+}
+{xeunicode} {
+	pg_wchar c = strtoul(yytext+2, NULL, 16);
+
+	if (c < 0xDC00 || c >= 0xE000)
+		yyerror("invalid Unicode surrogate pair");
+
+	c = ((xeu_surrogate1 & 0x3FF) << 10) | (c & 0x3FF);
+
+	addunicode(c + 0x1);
+
+	BEGIN(xe);
+}
+.			{
+	yyerror("invalid Unicode surrogate pair");
+}
+
 {xeescape}  {
 	if (yytext[1] == '\'')
 	{
@@ -1153,3 +1192,18 @@ check_escape_warning(void)
  lexer_errposition()));
 	warn_on_first_escape = false;	/* warn only once per string */
 }
+
+static void
+addunicode(pg_wchar c)
+{
+	char buf[8];
+
+	if (c == 0 || c > 0x10)
+		yyerror("invalid Unicode escape value");
+	if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8)
+		yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8");
+
+	unicode_to_utf8(c, (unsigned char *)buf);
+	addlit(buf, pg_mblen(buf));
+}
+

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers