Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-02-02 Thread Robert Haas
On Sat, Jan 31, 2015 at 8:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I understand Andrew to be saying that if you take a 6-character string
 and convert it to a JSON string and then back to text, you will
 *usually* get back the same 6 characters you started with ... unless
 the first character was \, the second u, and the remainder hexadecimal
 digits.  Then you'll get back a one-character string or an error
 instead.  It's not hard to imagine that leading to surprising
 behavior, or even security vulnerabilities in applications that aren't
 expecting such a translation to happen under them.

 That *was* the case, with the now-reverted patch that changed the escaping
 rules.  It's not anymore:

 regression=# select to_json('\u1234'::text);
   to_json
 ---
  \\u1234
 (1 row)

 When you convert that back to text, you'll get \u1234, no more and no
 less.  For example:

 regression=# select array_to_json(array['\u1234'::text]);
  array_to_json
 ---
  [\\u1234]
 (1 row)

 regression=# select array_to_json(array['\u1234'::text])-0;
  ?column?
 ---
  \\u1234
 (1 row)

 regression=# select array_to_json(array['\u1234'::text])-0;
  ?column?
 --
  \u1234
 (1 row)

 Now, if you put in '\u1234'::jsonb and extract that string as text,
 you get some Unicode character or other.  But I'd say that a JSON user
 who is surprised by that doesn't understand JSON, and definitely that they
 hadn't read more than about one paragraph of our description of the JSON
 types.

Totally agree.  That's why I think reverting the patch was the right
thing to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I understand Andrew to be saying that if you take a 6-character string
 and convert it to a JSON string and then back to text, you will
 *usually* get back the same 6 characters you started with ... unless
 the first character was \, the second u, and the remainder hexadecimal
 digits.  Then you'll get back a one-character string or an error
 instead.  It's not hard to imagine that leading to surprising
 behavior, or even security vulnerabilities in applications that aren't
 expecting such a translation to happen under them.

That *was* the case, with the now-reverted patch that changed the escaping
rules.  It's not anymore:

regression=# select to_json('\u1234'::text);
  to_json  
---
 \\u1234
(1 row)

When you convert that back to text, you'll get \u1234, no more and no
less.  For example:

regression=# select array_to_json(array['\u1234'::text]);
 array_to_json 
---
 [\\u1234]
(1 row)

regression=# select array_to_json(array['\u1234'::text])-0;
 ?column?  
---
 \\u1234
(1 row)

regression=# select array_to_json(array['\u1234'::text])-0;
 ?column? 
--
 \u1234
(1 row)

Now, if you put in '\u1234'::jsonb and extract that string as text,
you get some Unicode character or other.  But I'd say that a JSON user
who is surprised by that doesn't understand JSON, and definitely that they
hadn't read more than about one paragraph of our description of the JSON
types.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-30 Thread Peter Geoghegan
On Thu, Jan 29, 2015 at 11:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The point of JSONB is that we take a position on certain aspects like
 this. We're bridging a pointedly loosey goosey interchange format,
 JSON, with native PostgreSQL types. For example, we take a firm
 position on encoding. The JSON type is a bit more permissive, to about
 the extent that that's possible. The whole point is that we're
 interpreting JSON data in a way that's consistent with *Postgres*
 conventions. You'd have to interpret the data according to *some*
 convention in order to do something non-trivial with it in any case,
 and users usually want that.

 I quite agree with you, actually, in terms of that perspective.

Sure, but I wasn't sure that that was evident to others.

To emphasize: I think it's appropriate that the JSON spec takes
somewhat of a back seat approach to things like encoding and the
precision of numbers. I also think it's appropriate that JSONB does
not, up to and including where JSONB forbids things that the JSON spec
supposes could be useful. We haven't failed users by (say) not
accepting NULs, even though the spec suggests that that might be
useful - we have provided them with a reasonable, concrete
interpretation of that JSON data, with lots of useful operators, that
they may take or leave. It really isn't historical that we have both a
JSON and JSONB type. For other examples of this, see every document
database in existence.

Depart from this perspective, as an interchange standard author, and
you end up with something like XML, which while easy to reason about
isn't all that useful, or BSON, the binary interchange format, which
is an oxymoron.

 But my point remains: \u is not invalid JSON syntax, and neither is
 \u1234.  If we choose to throw an error because we can't interpret or
 process that according to our conventions, fine, but we should call it
 something other than invalid syntax.

 ERRCODE_UNTRANSLATABLE_CHARACTER or ERRCODE_CHARACTER_NOT_IN_REPERTOIRE
 seem more apropos from here.

I see. I'd go with ERRCODE_UNTRANSLATABLE_CHARACTER, then.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Robert Haas
On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
 which I'm inclined to think we need to simply revert, not render even
 more squirrely.

Yes, that commit looks broken.  If you convert from text to JSON you
should get a JSON string equivalent to the text you started out with.
That commit departs from that in favor of allowing \u sequences in
the text being converted to turn into a single character (or perhaps
an encoding error) after a text-json-text roundtrip.  Maybe I
haven't had enough caffeine today, but I can't see how that can
possibly be a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Noah Misch
On Wed, Jan 28, 2015 at 12:48:45PM -0500, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
  So at this point I propose that we reject \u when de-escaping JSON.
 
  I would have agreed on 2014-12-09, and this release is the last chance to 
  make
  such a change.  It is a bold wager that could pay off, but -1 from me 
  anyway.
 
 You only get to vote -1 if you have a credible alternative.  I don't see
 one.

I don't love json enough to keep participating in a thread where you dismiss
patches and comments from Andrew and myself as quite useless, utter
hogwash and non-credible.

nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Bruce Momjian
On Wed, Jan 28, 2015 at 03:13:47PM -0600, Jim Nasby wrote:
 While I sympathize with Noah's sentiments, the only thing that makes
 sense to me is that a JSON text field is treated the same way as we
 treat text. Right now, that means NUL is not allowed, period.

+1

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Andrew Dunstan


On 01/29/2015 04:59 PM, Robert Haas wrote:

On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan and...@dunslane.net wrote:

On 01/29/2015 12:10 PM, Robert Haas wrote:

On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:

The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
which I'm inclined to think we need to simply revert, not render even
more squirrely.

Yes, that commit looks broken.  If you convert from text to JSON you
should get a JSON string equivalent to the text you started out with.
That commit departs from that in favor of allowing \u sequences in
the text being converted to turn into a single character (or perhaps
an encoding error) after a text-json-text roundtrip.  Maybe I
haven't had enough caffeine today, but I can't see how that can
possibly be a good idea.

Possibly.

I'm coming down more and more on the side of Tom's suggestion just to ban
\u in jsonb. I think that would let us have some fairly simple and
consistent rules. I'm not too worried that we'll be disallowing input that
we've previously allowed. We've done that often in the past, although less
often in point releases. I certainly don't want to wait a full release cycle
to fix this if possible.

I have yet to understand what we fix by banning \u.  How is 
different from any other four-digit hexadecimal number that's not a
valid character in the current encoding?  What does banning that one
particular value do?

In any case, whatever we do about that issue, the idea that the text
- json string transformation can *change the input string into some
other string* seems like an independent problem.



jsonb stores string values as postgres text values, with the unicode 
escapes resolved, just as it also resolves numbers and booleans into 
their native representation.  If you want the input perfectly preserved, 
use json, not jsonb.  I think that's made pretty clear in the docs.


so text-jsonb-text is not and has never been expected to be a noop.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Robert Haas
On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan and...@dunslane.net wrote:
 On 01/29/2015 12:10 PM, Robert Haas wrote:
 On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
 which I'm inclined to think we need to simply revert, not render even
 more squirrely.
 Yes, that commit looks broken.  If you convert from text to JSON you
 should get a JSON string equivalent to the text you started out with.
 That commit departs from that in favor of allowing \u sequences in
 the text being converted to turn into a single character (or perhaps
 an encoding error) after a text-json-text roundtrip.  Maybe I
 haven't had enough caffeine today, but I can't see how that can
 possibly be a good idea.

 Possibly.

 I'm coming down more and more on the side of Tom's suggestion just to ban
 \u in jsonb. I think that would let us have some fairly simple and
 consistent rules. I'm not too worried that we'll be disallowing input that
 we've previously allowed. We've done that often in the past, although less
 often in point releases. I certainly don't want to wait a full release cycle
 to fix this if possible.

I have yet to understand what we fix by banning \u.  How is 
different from any other four-digit hexadecimal number that's not a
valid character in the current encoding?  What does banning that one
particular value do?

In any case, whatever we do about that issue, the idea that the text
- json string transformation can *change the input string into some
other string* seems like an independent problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Andrew Dunstan


On 01/29/2015 12:10 PM, Robert Haas wrote:

On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:

The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
which I'm inclined to think we need to simply revert, not render even
more squirrely.

Yes, that commit looks broken.  If you convert from text to JSON you
should get a JSON string equivalent to the text you started out with.
That commit departs from that in favor of allowing \u sequences in
the text being converted to turn into a single character (or perhaps
an encoding error) after a text-json-text roundtrip.  Maybe I
haven't had enough caffeine today, but I can't see how that can
possibly be a good idea.




Possibly.

I'm coming down more and more on the side of Tom's suggestion just to 
ban \u in jsonb. I think that would let us have some fairly simple 
and consistent rules. I'm not too worried that we'll be disallowing 
input that we've previously allowed. We've done that often in the past, 
although less often in point releases. I certainly don't want to wait a 
full release cycle to fix this if possible.


cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Andres Freund
On 2015-01-29 16:33:36 -0500, Andrew Dunstan wrote:
 
 On 01/29/2015 12:10 PM, Robert Haas wrote:
 On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
 which I'm inclined to think we need to simply revert, not render even
 more squirrely.
 Yes, that commit looks broken.  If you convert from text to JSON you
 should get a JSON string equivalent to the text you started out with.
 That commit departs from that in favor of allowing \u sequences in
 the text being converted to turn into a single character (or perhaps
 an encoding error) after a text-json-text roundtrip.  Maybe I
 haven't had enough caffeine today, but I can't see how that can
 possibly be a good idea.

 I'm coming down more and more on the side of Tom's suggestion just to ban
 \u in jsonb. I think that would let us have some fairly simple and
 consistent rules. I'm not too worried that we'll be disallowing input that
 we've previously allowed. We've done that often in the past, although less
 often in point releases. I certainly don't want to wait a full release cycle
 to fix this if possible.

I think waiting a full cycle would be likely to make things much worse,
given 9.4's current age.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan and...@dunslane.net wrote:
 I'm coming down more and more on the side of Tom's suggestion just to ban
 \u in jsonb.

 I have yet to understand what we fix by banning \u.  How is 
 different from any other four-digit hexadecimal number that's not a
 valid character in the current encoding?  What does banning that one
 particular value do?

As Andrew pointed out upthread, it avoids having to answer the question of
what to return for

select (jsonb '[foo\ubar]')-0;

or any other construct which is supposed to return an *unescaped* text
representation of some JSON string value.

Right now you get

   ?column?   
--
 foo\ubar
(1 row)

Which is wrong IMO, first because it violates the premise that the output
should be unescaped, and second because this output cannot be
distinguished from the (correct) output of

regression=# select (jsonb '[foo\\ubar]')-0;
   ?column?   
--
 foo\ubar
(1 row)

There is no way to deliver an output that is not confusable with some
other value's correct output, other than by emitting a genuine \0 byte
which unfortunately we cannot support in a TEXT result.

Potential solutions for this have been mooted upthread, but none
of them look like they're something we can do in the very short run.
So the proposal is to ban \u until such time as we can do something
sane with it.

 In any case, whatever we do about that issue, the idea that the text
 - json string transformation can *change the input string into some
 other string* seems like an independent problem.

No, it's exactly the same problem, because the reason for that breakage
is an ill-advised attempt to make it safe to include \u in JSONB.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I have yet to understand what we fix by banning \u.  How is 
 different from any other four-digit hexadecimal number that's not a
 valid character in the current encoding?  What does banning that one
 particular value do?

BTW, as to the point about encoding violations: we *already* ban \u
sequences that don't correspond to valid characters in the current
encoding.  The attempt to exclude U+ from the set of banned characters
was ill-advised, plain and simple.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Tom Lane
Attached is a draft patch for this.  It basically reverts commit
0ad1a816320a2b539a51628e2a0b1e83ff096b1d, adds a ban of \u if
that would need to be converted to text (so it still works in the
plain json type, so long as you don't do much processing), and adds
some regression tests.

I made the \u error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION)
and errmsg(invalid input syntax for type json), by analogy to what's
thrown for non-ASCII Unicode escapes in non-UTF8 encoding.  I'm not
terribly happy with that, though.  ISTM that for both cases, this is
not invalid syntax at all, but an implementation restriction that
forces us to reject perfectly valid syntax.  So I think we ought to
use a different ERRCODE and text message, though I'm not entirely
sure what it should be instead.  ERRCODE_FEATURE_NOT_SUPPORTED is
one possibility.

regards, tom lane

diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml
index 8feb2fb..b4b97a7 100644
*** a/doc/src/sgml/json.sgml
--- b/doc/src/sgml/json.sgml
***
*** 69,80 
regardless of the database encoding, and are checked only for syntactic
correctness (that is, that four hex digits follow literal\u/).
However, the input function for typejsonb/ is stricter: it disallows
!   Unicode escapes for non-ASCII characters (those
!   above literalU+007F/) unless the database encoding is UTF8.  It also
!   insists that any use of Unicode surrogate pairs to designate characters
!   outside the Unicode Basic Multilingual Plane be correct.  Valid Unicode
!   escapes, except for literal\u/, are then converted to the
!   equivalent ASCII or UTF8 character for storage.
   /para
  
   note
--- 69,82 
regardless of the database encoding, and are checked only for syntactic
correctness (that is, that four hex digits follow literal\u/).
However, the input function for typejsonb/ is stricter: it disallows
!   Unicode escapes for non-ASCII characters (those above literalU+007F/)
!   unless the database encoding is UTF8.  The typejsonb/ type also
!   rejects literal\u/ (because that cannot be represented in
!   productnamePostgreSQL/productname's typetext/ type), and it insists
!   that any use of Unicode surrogate pairs to designate characters outside
!   the Unicode Basic Multilingual Plane be correct.  Valid Unicode escapes
!   are converted to the equivalent ASCII or UTF8 character for storage;
!   this includes folding surrogate pairs into a single character.
   /para
  
   note
***
*** 134,140 
 row
  entrytypestring//entry
  entrytypetext//entry
! entrySee notes above concerning encoding restrictions/entry
 /row
 row
  entrytypenumber//entry
--- 136,143 
 row
  entrytypestring//entry
  entrytypetext//entry
! entryliteral\u/ is disallowed, as are non-ASCII Unicode
!  escapes if database encoding is not UTF8/entry
 /row
 row
  entrytypenumber//entry
diff --git a/doc/src/sgml/release-9.4.sgml b/doc/src/sgml/release-9.4.sgml
index 961e461..11bbf3b 100644
*** a/doc/src/sgml/release-9.4.sgml
--- b/doc/src/sgml/release-9.4.sgml
***
*** 103,124 
  
  listitem
   para
-   Unicode escapes in link linkend=datatype-jsontypeJSON/type/link
-   text values are no longer rendered with the backslash escaped
-   (Andrew Dunstan)
-  /para
- 
-  para
-   Previously, all backslashes in text values being formed into JSON
-   were escaped. Now a backslash followed by literalu/ and four
-   hexadecimal digits is not escaped, as this is a legal sequence in a
-   JSON string value, and escaping the backslash led to some perverse
-   results.
-  /para
- /listitem
- 
- listitem
-  para
When converting values of type typedate/, typetimestamp/
or typetimestamptz/
to link linkend=datatype-jsontypeJSON/type/link, render the
--- 103,108 
diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 3c137ea..4e46b0a 100644
*** a/src/backend/utils/adt/json.c
--- b/src/backend/utils/adt/json.c
*** json_lex_string(JsonLexContext *lex)
*** 806,819 
  	 * For UTF8, replace the escape sequence by the actual
  	 * utf8 character in lex-strval. Do this also for other
  	 * encodings if the escape designates an ASCII character,
! 	 * otherwise raise an error. We don't ever unescape a
! 	 * \u, since that would result in an impermissible nul
! 	 * byte.
  	 */
  
  	if (ch == 0)
  	{
! 		appendStringInfoString(lex-strval, \\u);
  	}
  	else if (GetDatabaseEncoding() == PG_UTF8)
  	{
--- 806,822 
  	 * For UTF8, replace the escape sequence by the actual
  	 * utf8 character in lex-strval. Do this also for other
  	 * encodings if the escape designates an ASCII character,
! 	 * 

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Peter Geoghegan
On Thu, Jan 29, 2015 at 10:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I made the \u error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION)
 and errmsg(invalid input syntax for type json), by analogy to what's
 thrown for non-ASCII Unicode escapes in non-UTF8 encoding.  I'm not
 terribly happy with that, though.  ISTM that for both cases, this is
 not invalid syntax at all, but an implementation restriction that
 forces us to reject perfectly valid syntax.  So I think we ought to
 use a different ERRCODE and text message, though I'm not entirely
 sure what it should be instead.  ERRCODE_FEATURE_NOT_SUPPORTED is
 one possibility.

I personally prefer what you have here.

The point of JSONB is that we take a position on certain aspects like
this. We're bridging a pointedly loosey goosey interchange format,
JSON, with native PostgreSQL types. For example, we take a firm
position on encoding. The JSON type is a bit more permissive, to about
the extent that that's possible. The whole point is that we're
interpreting JSON data in a way that's consistent with *Postgres*
conventions. You'd have to interpret the data according to *some*
convention in order to do something non-trivial with it in any case,
and users usually want that.

It's really nice the way encoding is a strict implementation detail
within Postgres in general, in the sense that you know that if your
application code is concerned about encoding, you're probably thinking
about the problem incorrectly (at least once data has crossed the
database encoding border). MySQL's laxidasical attitudes here appear
to have been an enormous mistake.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 I looked into it, and it turns out that MongoDB does not accept NUL in
 at least some contexts (for object keys). Apparently it wasn't always
 so. MongoDB previously had a security issue that was fixed by
 introducing this restriction. Their JSON-centric equivalent of
 per-column privileges was for a time compromised, because NUL
 injection was possible:

 https://www.idontplaydarts.com/2011/02/mongodb-null-byte-injection-attacks/

 It's easy to bash MongoDB, but this is still an interesting data
 point. They changed this after the fact, and yet I can find no
 evidence of any grumbling about it from end users. No one really
 noticed.

Hoo, that's interesting.  Lends some support to my half-baked idea that
we might disallow NUL in object keys even if we are able to allow it
elsewhere in JSON strings.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Andrew Dunstan


On 01/29/2015 05:39 PM, Tom Lane wrote:

Robert Haas robertmh...@gmail.com writes:

I have yet to understand what we fix by banning \u.  How is 
different from any other four-digit hexadecimal number that's not a
valid character in the current encoding?  What does banning that one
particular value do?

BTW, as to the point about encoding violations: we *already* ban \u
sequences that don't correspond to valid characters in the current
encoding.  The attempt to exclude U+ from the set of banned characters
was ill-advised, plain and simple.




Actually, unless the encoding is utf8 we ban all non-ascii unicode 
escapes even if they might designate a valid character in the current 
encoding. This was arrived at after some discussion here. So adding 
\u to the list of banned characters is arguably just making us more 
consistent.



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Tom Lane
Peter Geoghegan p...@heroku.com writes:
 On Thu, Jan 29, 2015 at 10:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I made the \u error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION)
 and errmsg(invalid input syntax for type json), by analogy to what's
 thrown for non-ASCII Unicode escapes in non-UTF8 encoding.  I'm not
 terribly happy with that, though.  ISTM that for both cases, this is
 not invalid syntax at all, but an implementation restriction that
 forces us to reject perfectly valid syntax.  So I think we ought to
 use a different ERRCODE and text message, though I'm not entirely
 sure what it should be instead.  ERRCODE_FEATURE_NOT_SUPPORTED is
 one possibility.

 I personally prefer what you have here.

 The point of JSONB is that we take a position on certain aspects like
 this. We're bridging a pointedly loosey goosey interchange format,
 JSON, with native PostgreSQL types. For example, we take a firm
 position on encoding. The JSON type is a bit more permissive, to about
 the extent that that's possible. The whole point is that we're
 interpreting JSON data in a way that's consistent with *Postgres*
 conventions. You'd have to interpret the data according to *some*
 convention in order to do something non-trivial with it in any case,
 and users usually want that.

I quite agree with you, actually, in terms of that perspective.  But my
point remains: \u is not invalid JSON syntax, and neither is
\u1234.  If we choose to throw an error because we can't interpret or
process that according to our conventions, fine, but we should call it
something other than invalid syntax.

ERRCODE_UNTRANSLATABLE_CHARACTER or ERRCODE_CHARACTER_NOT_IN_REPERTOIRE
seem more apropos from here.  And I still think there's a case to be
made for ERRCODE_FEATURE_NOT_SUPPORTED, because it's at least possible
that we'd relax this restriction in future (eg, allow Unicode characters
that can be converted to the database's encoding).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
 So at this point I propose that we reject \u when de-escaping JSON.

 I would have agreed on 2014-12-09, and this release is the last chance to make
 such a change.  It is a bold wager that could pay off, but -1 from me anyway.

You only get to vote -1 if you have a credible alternative.  I don't see
one.

 I can already envision the blog post from the DBA staying on 9.4.0 because
 9.4.1 pulled his ability to store U+ in jsonb.

Those will be more or less the same people who bitch about text not
storing NULs; the world has not fallen.

 jsonb was *the* top-billed
 9.4 feature, and this thread started with Andrew conveying a field report of a
 scenario more obscure than storing U+.

There is a separate issue, which is that our earlier attempt to make the
world safe for \u actually broke other things.  We still need to fix
that, but I think the fix probably consists of reverting that patch and
instead disallowing \u.

 Anybody who's seriously unhappy with that can propose a patch to fix it
 properly in 9.5 or later.

 Someone can still do that by introducing a V2 of the jsonb binary format and
 preserving the ability to read both formats.  (Too bad Andres's proposal to
 include a format version didn't inform the final format, but we can wing it.)
 I agree that storing U+ as 0x00 is the best end state.

We will not need a v2 format, merely values that contain NULs.  Existing
data containing \u will be read as those six ASCII characters, which
is not really wrong since that might well be what it was anyway.

 We probably need to rethink the re-escaping behavior as well; I'm not
 sure if your latest patch is the right answer for that.

 Yes, we do.  No change to the representation of U+ is going to fix the
 following bug, but that patch does fix it:

 [local] test=# select
 test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb,
 test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb::text::jsonb;

The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
which I'm inclined to think we need to simply revert, not render even
more squirrely.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Andrew Dunstan


On 01/28/2015 12:50 AM, Noah Misch wrote:

On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 01/27/2015 02:28 PM, Tom Lane wrote:

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

So at this point I propose that we reject \u when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change.  It is a bold wager that could pay off, but -1 from me anyway.
I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+ in jsonb.  jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+.  Therefore, we have to assume many
users will notice the change.  This move would also add to the growing
evidence that our .0 releases are really beta(N+1) releases in disguise.


Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats.  (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+ as 0x00 is the best end state.





We need to make up our minds about this pretty quickly. The more radical 
move is likely to involve quite a bit of work, ISTM.


It's not clear to me how we should represent a unicode null. i.e. given 
a json of '[foo\ubar]', I get that we'd store the element as 
'foo\x00bar', but what is the result of


   (jsonb '[foo\ubar')-0

It's defined to be text so we can't just shove a binary null in the 
middle of it. Do we throw an error?


And I still want to hear more voices on the whole direction we want to 
take this.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 It's not clear to me how we should represent a unicode null. i.e. given 
 a json of '[foo\ubar]', I get that we'd store the element as 
 'foo\x00bar', but what is the result of

 (jsonb '[foo\ubar')-0

 It's defined to be text so we can't just shove a binary null in the 
 middle of it. Do we throw an error?

Yes, that is what I was proposing upthread.  Obviously, this needs some
thought to ensure that there's *something* useful you can do with a field
containing a nul, but we'd have little choice but to throw an error if
the user asks us to convert such a field to unescaped text.

I'd be a bit inclined to reject nuls in object field names even if we
allow them in field values, since just about everything you can usefully
do with a field name involves regarding it as text.

Another interesting implementation problem is what does indexing do with
such values --- ISTR there's an implicit conversion to C strings in there
too, at least in GIN indexes.

Anyway, there is a significant amount of work involved here, and there's
no way we're getting it done for 9.4.1, or probably 9.4.anything.  I think
our only realistic choice right now is to throw error for \u so that
we can preserve our options for doing something useful with it later.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Tom Lane
Jim Nasby jim.na...@bluetreble.com writes:
 While I sympathize with Noah's sentiments, the only thing that makes sense to 
 me is that a JSON text field is treated the same way as we treat text. Right 
 now, that means NUL is not allowed, period.

 If no one has bitched about this with text, is it really that big a problem 
 with JSON?

Oh, people have bitched about it all right ;-).  But I think your real
question is how many applications find that restriction to be fatal.
The answer clearly is not a lot for text, and I don't see why the
answer would be different for JSON.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Jim Nasby

On 1/28/15 11:36 AM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

It's not clear to me how we should represent a unicode null. i.e. given
a json of '[foo\ubar]', I get that we'd store the element as
'foo\x00bar', but what is the result of



 (jsonb '[foo\ubar')-0



It's defined to be text so we can't just shove a binary null in the
middle of it. Do we throw an error?


Yes, that is what I was proposing upthread.  Obviously, this needs some
thought to ensure that there's *something* useful you can do with a field
containing a nul, but we'd have little choice but to throw an error if
the user asks us to convert such a field to unescaped text.

I'd be a bit inclined to reject nuls in object field names even if we
allow them in field values, since just about everything you can usefully
do with a field name involves regarding it as text.

Another interesting implementation problem is what does indexing do with
such values --- ISTR there's an implicit conversion to C strings in there
too, at least in GIN indexes.

Anyway, there is a significant amount of work involved here, and there's
no way we're getting it done for 9.4.1, or probably 9.4.anything.  I think
our only realistic choice right now is to throw error for \u so that
we can preserve our options for doing something useful with it later.


My $0.01:

While I sympathize with Noah's sentiments, the only thing that makes sense to 
me is that a JSON text field is treated the same way as we treat text. Right 
now, that means NUL is not allowed, period.

If no one has bitched about this with text, is it really that big a problem 
with JSON?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Jeff Janes
On Wed, Jan 28, 2015 at 1:13 PM, Jim Nasby jim.na...@bluetreble.com wrote:


 My $0.01:

 While I sympathize with Noah's sentiments, the only thing that makes sense
 to me is that a JSON text field is treated the same way as we treat text.
 Right now, that means NUL is not allowed, period.

 If no one has bitched about this with text, is it really that big a
 problem with JSON?


I would bitch about it for text if I thought that it would do any good.


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Tom Lane
David G Johnston david.g.johns...@gmail.com writes:
 Am I missing something or has there been no consideration in this forbid
 plan on whether users will be able to retrieve, even if partially
 incorrectly, any jsonb data that has already been stored?

Data that's already been stored would look like the six characters \u,
which indeed might have been what it was anyway, since part of the
problem here is that we fail to distinguish \\u from \u.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread David G Johnston
Tom Lane-2 wrote
 Andrew Dunstan lt;

 andrew@

 gt; writes:
 On 01/27/2015 02:28 PM, Tom Lane wrote:
 Well, we can either fix it now or suffer with a broken representation
 forever.  I'm not wedded to the exact solution I described, but I think
 we'll regret it if we don't change the representation.
 
 The only other plausible answer seems to be to flat out reject \u.
 But I assume nobody likes that.
 
 I don't think we can be in the business of rejecting valid JSON.
 
 Actually, after studying the code a bit, I wonder if we wouldn't be best
 off to do exactly that, at least for 9.4.x.  At minimum we're talking
 about an API change for JsonSemAction functions (which currently get the
 already-de-escaped string as a C string; not gonna work for embedded
 nulls).  I'm not sure if there are already third-party extensions using
 that API, but it seems possible, in which case changing it in a minor
 release wouldn't be nice.  Even ignoring that risk, making sure
 we'd fixed everything seems like more than a day's work, which is as
 much as I for one could spare before 9.4.1.
 
 So at this point I propose that we reject \u when de-escaping JSON.
 Anybody who's seriously unhappy with that can propose a patch to fix it
 properly in 9.5 or later.

The hybrid option is to reject the values for 9.4.1 but then commit to
removing that hack and fixing this properly in 9.4.2; we can always call
that release 9.5...


 I think the it would mean rejecting valid JSON argument is utter
 hogwash.  We already reject, eg, \u00A0 if you're not using a UTF8
 encoding.  And we reject 1e1, not because that's invalid JSON
 but because of an implementation restriction of our underlying numeric
 type.  I don't see any moral superiority of that over rejecting \u
 because of an implementation restriction of our underlying text type.

Am I missing something or has there been no consideration in this forbid
plan on whether users will be able to retrieve, even if partially
incorrectly, any jsonb data that has already been stored?  If we mangled
their data on input we should at least return the data and provide them a
chance to manually (or automatically depending on their data) fix our
mistake.

Given we already disallow NUL in text ISTM that allowing said data in other
text-like areas is asking for just the kind of trouble we are seeing here. 
I'm OK with the proposition that those wishing to utilize NUL are relegated
to working with bytea.

From the commit Tom references down-thread:

However, this led to some perverse results in the case of Unicode
sequences.

Given that said results are not documented in the commit its hard to judge
whether a complete revert is being overly broad...

Anyway, just some observations since I'm not currently a user of JSON. 
Tom's arguments and counter-arguments ring true to me in the general sense. 
The DBA staying on 9.4.0 because of this change probably just needs to be
told to go back to using json and then run the update.  Their data has
issues even they stay on 9.4.0 with the more accepting version of jsonb.

David J.




--
View this message in context: 
http://postgresql.nabble.com/jsonb-unicode-escapes-and-escaped-backslashes-tp5834962p5835824.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 01:40 PM, Tom Lane wrote:
 In particular, I would like to suggest that the current representation of
 \u is fundamentally broken and that we have to change it, not try to
 band-aid around it.  This will mean an on-disk incompatibility for jsonb
 data containing U+, but hopefully there is very little of that out
 there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
 consider such solutions.

 Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
 the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u.
But I assume nobody likes that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 02:28 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 01/27/2015 01:40 PM, Tom Lane wrote:

In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on
the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u.
But I assume nobody likes that.




I don't think we can be in the business of rejecting valid JSON.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 02:28 PM, Tom Lane wrote:
 Well, we can either fix it now or suffer with a broken representation
 forever.  I'm not wedded to the exact solution I described, but I think
 we'll regret it if we don't change the representation.
 
 The only other plausible answer seems to be to flat out reject \u.
 But I assume nobody likes that.

 I don't think we can be in the business of rejecting valid JSON.

Actually, after studying the code a bit, I wonder if we wouldn't be best
off to do exactly that, at least for 9.4.x.  At minimum we're talking
about an API change for JsonSemAction functions (which currently get the
already-de-escaped string as a C string; not gonna work for embedded
nulls).  I'm not sure if there are already third-party extensions using
that API, but it seems possible, in which case changing it in a minor
release wouldn't be nice.  Even ignoring that risk, making sure
we'd fixed everything seems like more than a day's work, which is as
much as I for one could spare before 9.4.1.

Also, while the idea of throwing error only when a \0 needs to be
converted to text seems logically clean, it looks like that might pretty
much cripple the usability of such values anyhow, because we convert to
text at the drop of a hat.  So some investigation and probably additional
work would be needed to ensure you could do at least basic things with
such values.  (A function for direct conversion to bytea might be useful
too.)

I think the it would mean rejecting valid JSON argument is utter
hogwash.  We already reject, eg, \u00A0 if you're not using a UTF8
encoding.  And we reject 1e1, not because that's invalid JSON
but because of an implementation restriction of our underlying numeric
type.  I don't see any moral superiority of that over rejecting \u
because of an implementation restriction of our underlying text type.

So at this point I propose that we reject \u when de-escaping JSON.
Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

We probably need to rethink the re-escaping behavior as well; I'm not
sure if your latest patch is the right answer for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In particular, I would like to suggest that the current representation of
 \u is fundamentally broken and that we have to change it, not try to
 band-aid around it.  This will mean an on-disk incompatibility for jsonb
 data containing U+, but hopefully there is very little of that out
 there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
 consider such solutions.
 
 The most obvious way to store such data unambiguously is to just go ahead
 and store U+ as a NUL byte (\000).  The only problem with that is that
 then such a string cannot be considered to be a valid value of type TEXT,
 which would mean that we'd need to throw an error if we were asked to
 convert a JSON field containing such a character to text.

 Hm, does this include text out operations for display purposes?   If
 so, then any query selecting jsonb objects with null bytes would fail.
 How come we have to error out?  How about a warning indicating the
 string was truncated?

No, that's not a problem, because jsonb_out would produce an escaped
textual representation, so any null would come out as \u again.
The trouble comes up when you do something that's supposed to produce
a *non escaped* text equivalent of a JSON string value, such as
the - operator.

Arguably, - is broken already with the current coding, in that
these results are entirely inconsistent:

regression=# select '{a:foo\u0040bar}'::jsonb - 'a';
 ?column? 
--
 foo@bar
(1 row)

regression=# select '{a:foo\ubar}'::jsonb - 'a';
   ?column?   
--
 foo\ubar
(1 row)

regression=# select '{a:foo\\ubar}'::jsonb - 'a';
   ?column?   
--
 foo\ubar
(1 row)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 12:23 PM, Tom Lane wrote:
 I think coding anything is premature until we decide how we're going to
 deal with the fundamental ambiguity.

 The input \\uabcd will be stored correctly as \uabcd, but this will in 
 turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. 
 That's what the patch fixes.
 There are two problems here and this addresses one of them. The other 
 problem is the ambiguity regarding \\u and \u.

It's the same problem really, and until we have an answer about
what to do with \u, I think any patch is half-baked and possibly
counterproductive.

In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

The most obvious way to store such data unambiguously is to just go ahead
and store U+ as a NUL byte (\000).  The only problem with that is that
then such a string cannot be considered to be a valid value of type TEXT,
which would mean that we'd need to throw an error if we were asked to
convert a JSON field containing such a character to text.  I don't
particularly have a problem with that, except possibly for the time cost
of checking for \000 before allowing a conversion to occur.  While a
memchr() check might be cheap enough, we could also consider inventing a
new JEntry type code for string-containing-null, so that there's a
distinction in the type system between strings that are coercible to text
and those that are not.

If we went down a path like that, the currently proposed patch would be
quite useless.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 01:40 PM, Tom Lane wrote:


In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.




Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
the table what I suggested is unnecessary.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Merlin Moncure
On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 12:23 PM, Tom Lane wrote:
 I think coding anything is premature until we decide how we're going to
 deal with the fundamental ambiguity.

 The input \\uabcd will be stored correctly as \uabcd, but this will in
 turn be rendered as \uabcd, whereas it should be rendered as \\uabcd.
 That's what the patch fixes.
 There are two problems here and this addresses one of them. The other
 problem is the ambiguity regarding \\u and \u.

 It's the same problem really, and until we have an answer about
 what to do with \u, I think any patch is half-baked and possibly
 counterproductive.

 In particular, I would like to suggest that the current representation of
 \u is fundamentally broken and that we have to change it, not try to
 band-aid around it.  This will mean an on-disk incompatibility for jsonb
 data containing U+, but hopefully there is very little of that out
 there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
 consider such solutions.

 The most obvious way to store such data unambiguously is to just go ahead
 and store U+ as a NUL byte (\000).  The only problem with that is that
 then such a string cannot be considered to be a valid value of type TEXT,
 which would mean that we'd need to throw an error if we were asked to
 convert a JSON field containing such a character to text.

Hm, does this include text out operations for display purposes?   If
so, then any query selecting jsonb objects with null bytes would fail.
How come we have to error out?  How about a warning indicating the
string was truncated?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Noah Misch
On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
 Andrew Dunstan and...@dunslane.net writes:
  On 01/27/2015 02:28 PM, Tom Lane wrote:
  Well, we can either fix it now or suffer with a broken representation
  forever.  I'm not wedded to the exact solution I described, but I think
  we'll regret it if we don't change the representation.

 So at this point I propose that we reject \u when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change.  It is a bold wager that could pay off, but -1 from me anyway.
I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+ in jsonb.  jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+.  Therefore, we have to assume many
users will notice the change.  This move would also add to the growing
evidence that our .0 releases are really beta(N+1) releases in disguise.

 Anybody who's seriously unhappy with that can propose a patch to fix it
 properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats.  (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+ as 0x00 is the best end state.

 We probably need to rethink the re-escaping behavior as well; I'm not
 sure if your latest patch is the right answer for that.

Yes, we do.  No change to the representation of U+ is going to fix the
following bug, but that patch does fix it:

[local] test=# select
test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb,
test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb::text::jsonb;
 ?column? | ?column? 
--+--
 t| f
(1 row)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 12:24 AM, Noah Misch wrote:

For the moment, maybe I could commit the fix for the non U+ case for
escape_json, and we could think some more about detecting and warning about
the problem strings.

+1 for splitting development that way.  Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.




OK, so something like this patch? I'm mildly concerned that you and I 
are the only ones commenting on this.


cheers

andrew


diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 3c137ea..8d2b38f 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -94,6 +94,8 @@ static void datum_to_json(Datum val, bool is_null, StringInfo result,
 static void add_json(Datum val, bool is_null, StringInfo result,
 		 Oid val_type, bool key_scalar);
 static text *catenate_stringinfo_string(StringInfo buffer, const char *addon);
+static void  escape_json_work(StringInfo buf, const char *str,
+			  bool jsonb_unicode);
 
 /* the null action object used for pure validation */
 static JsonSemAction nullSemAction =
@@ -2356,6 +2358,18 @@ json_object_two_arg(PG_FUNCTION_ARGS)
 void
 escape_json(StringInfo buf, const char *str)
 {
+	escape_json_work( buf, str, false);
+}
+
+void
+escape_jsonb(StringInfo buf, const char *str)
+{
+	escape_json_work( buf, str, true);
+}
+
+static void
+escape_json_work(StringInfo buf, const char *str, bool jsonb_unicode)
+{
 	const char *p;
 
 	appendStringInfoCharMacro(buf, '\');
@@ -2398,7 +2412,9 @@ escape_json(StringInfo buf, const char *str)
  * unicode escape that should be present is \u, all the
  * other unicode escapes will have been resolved.
  */
-if (p[1] == 'u' 
+if (jsonb_unicode  strncmp(p+1, u, 5) == 0)
+	appendStringInfoCharMacro(buf, *p);
+else if (!jsonb_unicode  p[1] == 'u' 
 	isxdigit((unsigned char) p[2]) 
 	isxdigit((unsigned char) p[3]) 
 	isxdigit((unsigned char) p[4]) 
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 644ea6d..22e1263 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -305,7 +305,7 @@ jsonb_put_escaped_value(StringInfo out, JsonbValue *scalarVal)
 			appendBinaryStringInfo(out, null, 4);
 			break;
 		case jbvString:
-			escape_json(out, pnstrdup(scalarVal-val.string.val, scalarVal-val.string.len));
+			escape_jsonb(out, pnstrdup(scalarVal-val.string.val, scalarVal-val.string.len));
 			break;
 		case jbvNumeric:
 			appendStringInfoString(out,
diff --git a/src/include/utils/json.h b/src/include/utils/json.h
index 6f69403..661d7bd 100644
--- a/src/include/utils/json.h
+++ b/src/include/utils/json.h
@@ -42,7 +42,13 @@ extern Datum json_build_array_noargs(PG_FUNCTION_ARGS);
 extern Datum json_object(PG_FUNCTION_ARGS);
 extern Datum json_object_two_arg(PG_FUNCTION_ARGS);
 
+/*
+ * escape_jsonb is more strict about unicode escapes, and will only not
+ * escape \u, as that is the only unicode escape that should still be
+ * present.
+ */
 extern void escape_json(StringInfo buf, const char *str);
+extern void escape_jsonb(StringInfo buf, const char *str);
 
 extern Datum json_typeof(PG_FUNCTION_ARGS);
 
diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out
index aa5686f..b5457c4 100644
--- a/src/test/regress/expected/jsonb.out
+++ b/src/test/regress/expected/jsonb.out
@@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04');
 (1 row)
 
 COMMIT;
--- unicode escape - backslash is not escaped
+-- unicode null - backslash not escaped
+-- note: using to_jsonb here bypasses the jsonb input routine.
+select jsonb '\u', to_jsonb(text '\u');
+  jsonb   | to_jsonb 
+--+--
+ \u | \u
+(1 row)
+
+-- any other unicode-looking escape - backslash is escaped
+-- all unicode characters should have been resolved
 select to_jsonb(text '\uabcd');
- to_jsonb 
---
- \uabcd
+ to_jsonb  
+---
+ \\uabcd
 (1 row)
 
 -- any other backslash is escaped
@@ -338,6 +347,14 @@ select to_jsonb(text '\abcd');
  \\abcd
 (1 row)
 
+-- doubled backslash should be reproduced
+-- this isn't a unicode escape, it's an escaped backslash followed by 'u'
+select jsonb '\\uabcd';
+   jsonb   
+---
+ \\uabcd
+(1 row)
+
 --jsonb_agg
 CREATE TEMP TABLE rows AS
 SELECT x, 'txt' || x as y
diff --git a/src/test/regress/expected/jsonb_1.out b/src/test/regress/expected/jsonb_1.out
index 687ae63..10aaac8 100644
--- a/src/test/regress/expected/jsonb_1.out
+++ b/src/test/regress/expected/jsonb_1.out
@@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04');
 (1 row)
 
 COMMIT;
--- unicode escape - backslash is not escaped
+-- unicode null - backslash not escaped
+-- note: using to_jsonb here bypasses the jsonb input routine.
+select jsonb '\u', to_jsonb(text '\u');
+  jsonb   | to_jsonb 
+--+--
+ \u | \u
+(1 row)
+
+-- any 

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 12:24 AM, Noah Misch wrote:
 +1 for splitting development that way.  Fixing the use of escape_json() is
 objective, but the choices around the warning are more subtle.

 OK, so something like this patch? I'm mildly concerned that you and I 
 are the only ones commenting on this.

Doesn't seem to me like this fixes anything.  If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Andrew Dunstan


On 01/27/2015 12:23 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 01/27/2015 12:24 AM, Noah Misch wrote:

+1 for splitting development that way.  Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.

OK, so something like this patch? I'm mildly concerned that you and I
are the only ones commenting on this.

Doesn't seem to me like this fixes anything.  If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.





The input \\uabcd will be stored correctly as \uabcd, but this will in 
turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. 
That's what the patch fixes.


There are two problems here and this addresses one of them. The other 
problem is the ambiguity regarding \\u and \u.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-26 Thread Andrew Dunstan


On 01/23/2015 02:18 AM, Noah Misch wrote:

On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:

The following case has just been brought to my attention (look at the
differing number of backslashes):

andrew=# select jsonb '\\u';
   jsonb
--
  \u
(1 row)

andrew=# select jsonb '\u';
   jsonb
--
  \u
(1 row)

A mess indeed.  The input is unambiguous, but the jsonb representation can't
distinguish \u from \\u.  Some operations on the original json
type have similar problems, since they use an in-memory binary representation
with the same shortcoming:

[local] test=# select json_array_element_text($$[\u]$$, 0) =
test-# json_array_element_text($$[\\u]$$, 0);
  ?column?
--
  t
(1 row)


Things get worse, though. On output, '\uabcd' for any four hex digits is
recognized as a unicode escape, and thus the backslash is not escaped, so
that we get:

andrew=# select jsonb '\\uabcd';
   jsonb
--
  \uabcd
(1 row)


We could probably fix this fairly easily for non- U+ cases by having
jsonb_to_cstring use a different escape_json routine.

Sounds reasonable.  For 9.4.1, before more people upgrade?


But it's a mess, sadly, and I'm not sure what a good fix for the U+ case
would look like.

Agreed.  When a string unescape algorithm removes some kinds of backslash
escapes and not others, it's nigh inevitable that two semantically-distinct
inputs can yield the same output.  json_lex_string() fell into that trap by
making an exception for \u.  To fix this, the result needs to be fully
unescaped (\u converted to the NUL byte) or retain all backslash escapes.
(Changing that either way is no fun now that an on-disk format is at stake.)


Maybe we should detect such input and emit a warning of
ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
like, since this is a report from the field.

Perhaps.  Something like WARNING:  jsonb cannot represent \\u; reading as
\u?  Alas, but I do prefer that to silent data corruption.





Maybe something like this patch. I have two concerns about it, though. 
The first is the possible performance impact of looking for the 
offending string in every jsonb input, and the second is that the test 
isn't quite right, since input of \\\u doesn't raise this issue - 
i.e. the problem arises when u is preceded by an even number of 
backslashes.


For the moment, maybe I could commit the fix for the non U+ case for 
escape_json, and we could think some more about detecting and warning 
about the problem strings.


cheers

andrew
diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 3c137ea..8d2b38f 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -94,6 +94,8 @@ static void datum_to_json(Datum val, bool is_null, StringInfo result,
 static void add_json(Datum val, bool is_null, StringInfo result,
 		 Oid val_type, bool key_scalar);
 static text *catenate_stringinfo_string(StringInfo buffer, const char *addon);
+static void  escape_json_work(StringInfo buf, const char *str,
+			  bool jsonb_unicode);
 
 /* the null action object used for pure validation */
 static JsonSemAction nullSemAction =
@@ -2356,6 +2358,18 @@ json_object_two_arg(PG_FUNCTION_ARGS)
 void
 escape_json(StringInfo buf, const char *str)
 {
+	escape_json_work( buf, str, false);
+}
+
+void
+escape_jsonb(StringInfo buf, const char *str)
+{
+	escape_json_work( buf, str, true);
+}
+
+static void
+escape_json_work(StringInfo buf, const char *str, bool jsonb_unicode)
+{
 	const char *p;
 
 	appendStringInfoCharMacro(buf, '\');
@@ -2398,7 +2412,9 @@ escape_json(StringInfo buf, const char *str)
  * unicode escape that should be present is \u, all the
  * other unicode escapes will have been resolved.
  */
-if (p[1] == 'u' 
+if (jsonb_unicode  strncmp(p+1, u, 5) == 0)
+	appendStringInfoCharMacro(buf, *p);
+else if (!jsonb_unicode  p[1] == 'u' 
 	isxdigit((unsigned char) p[2]) 
 	isxdigit((unsigned char) p[3]) 
 	isxdigit((unsigned char) p[4]) 
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 644ea6d..ba1e7f0 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -218,6 +218,14 @@ jsonb_from_cstring(char *json, int len)
 	JsonbInState state;
 	JsonSemAction sem;
 
+	if (strstr(json,u) != NULL)
+		ereport(WARNING,
+(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
+ errmsg(jsonb cannot represent u; reading as
+\\u),
+ errdetail(Due to an implementation restriction, jsonb cannot represent a unicode null immediately preceded by a backslash.)));
+		elog(WARNING,);
+
 	memset(state, 0, sizeof(state));
 	memset(sem, 0, sizeof(sem));
 	lex = makeJsonLexContextCstringLen(json, len, true);
@@ -305,7 +313,7 @@ jsonb_put_escaped_value(StringInfo out, JsonbValue *scalarVal)
 			

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-26 Thread Noah Misch
On Mon, Jan 26, 2015 at 09:20:54AM -0500, Andrew Dunstan wrote:
 On 01/23/2015 02:18 AM, Noah Misch wrote:
 On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:
 We could probably fix this fairly easily for non- U+ cases by having
 jsonb_to_cstring use a different escape_json routine.

 Maybe we should detect such input and emit a warning of
 ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
 like, since this is a report from the field.

 Perhaps.  Something like WARNING:  jsonb cannot represent \\u; reading 
 as
 \u?  Alas, but I do prefer that to silent data corruption.
 
 Maybe something like this patch. I have two concerns about it, though. The
 first is the possible performance impact of looking for the offending string
 in every jsonb input, and the second is that the test isn't quite right,
 since input of \\\u doesn't raise this issue - i.e. the problem arises
 when u is preceded by an even number of backslashes.

I share your second concern.  It is important that this warning not cry wolf;
it should never fire for an input that is actually unaffected.

 For the moment, maybe I could commit the fix for the non U+ case for
 escape_json, and we could think some more about detecting and warning about
 the problem strings.

+1 for splitting development that way.  Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-22 Thread Noah Misch
On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:
 The following case has just been brought to my attention (look at the
 differing number of backslashes):
 
andrew=# select jsonb '\\u';
   jsonb
--
  \u
(1 row)
 
andrew=# select jsonb '\u';
   jsonb
--
  \u
(1 row)

A mess indeed.  The input is unambiguous, but the jsonb representation can't
distinguish \u from \\u.  Some operations on the original json
type have similar problems, since they use an in-memory binary representation
with the same shortcoming:

[local] test=# select json_array_element_text($$[\u]$$, 0) =
test-# json_array_element_text($$[\\u]$$, 0);
 ?column? 
--
 t
(1 row)

 Things get worse, though. On output, '\uabcd' for any four hex digits is
 recognized as a unicode escape, and thus the backslash is not escaped, so
 that we get:
 
andrew=# select jsonb '\\uabcd';
   jsonb
--
  \uabcd
(1 row)
 
 
 We could probably fix this fairly easily for non- U+ cases by having
 jsonb_to_cstring use a different escape_json routine.

Sounds reasonable.  For 9.4.1, before more people upgrade?

 But it's a mess, sadly, and I'm not sure what a good fix for the U+ case
 would look like.

Agreed.  When a string unescape algorithm removes some kinds of backslash
escapes and not others, it's nigh inevitable that two semantically-distinct
inputs can yield the same output.  json_lex_string() fell into that trap by
making an exception for \u.  To fix this, the result needs to be fully
unescaped (\u converted to the NUL byte) or retain all backslash escapes.
(Changing that either way is no fun now that an on-disk format is at stake.)

 Maybe we should detect such input and emit a warning of
 ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
 like, since this is a report from the field.

Perhaps.  Something like WARNING:  jsonb cannot represent \\u; reading as
\u?  Alas, but I do prefer that to silent data corruption.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers