subject:"\[HACKERS\] jsonb, unicode escapes and escaped backslashes"

Robert Haas robertmh...@gmail.com writes:
 On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan and...@dunslane.net wrote:
 I'm coming down more and more on the side of Tom's suggestion just to ban
 \u in jsonb.

 I have yet to understand what we fix by banning \u.  How is 
 different from any other four-digit hexadecimal number that's not a
 valid character in the current encoding?  What does banning that one
 particular value do?

As Andrew pointed out upthread, it avoids having to answer the question of
what to return for

select (jsonb '[foo\ubar]')-0;

or any other construct which is supposed to return an *unescaped* text
representation of some JSON string value.

Right now you get

   ?column?   
--
 foo\ubar
(1 row)

Which is wrong IMO, first because it violates the premise that the output
should be unescaped, and second because this output cannot be
distinguished from the (correct) output of

regression=# select (jsonb '[foo\\ubar]')-0;
   ?column?   
--
 foo\ubar
(1 row)

There is no way to deliver an output that is not confusable with some
other value's correct output, other than by emitting a genuine \0 byte
which unfortunately we cannot support in a TEXT result.

Potential solutions for this have been mooted upthread, but none
of them look like they're something we can do in the very short run.
So the proposal is to ban \u until such time as we can do something
sane with it.

 In any case, whatever we do about that issue, the idea that the text
 - json string transformation can *change the input string into some
 other string* seems like an independent problem.

No, it's exactly the same problem, because the reason for that breakage
is an ill-advised attempt to make it safe to include \u in JSONB.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Robert Haas robertmh...@gmail.com writes:
 I have yet to understand what we fix by banning \u.  How is 
 different from any other four-digit hexadecimal number that's not a
 valid character in the current encoding?  What does banning that one
 particular value do?

BTW, as to the point about encoding violations: we *already* ban \u
sequences that don't correspond to valid characters in the current
encoding.  The attempt to exclude U+ from the set of banned characters
was ill-advised, plain and simple.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Attached is a draft patch for this.  It basically reverts commit
0ad1a816320a2b539a51628e2a0b1e83ff096b1d, adds a ban of \u if
that would need to be converted to text (so it still works in the
plain json type, so long as you don't do much processing), and adds
some regression tests.

I made the \u error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION)
and errmsg(invalid input syntax for type json), by analogy to what's
thrown for non-ASCII Unicode escapes in non-UTF8 encoding.  I'm not
terribly happy with that, though.  ISTM that for both cases, this is
not invalid syntax at all, but an implementation restriction that
forces us to reject perfectly valid syntax.  So I think we ought to
use a different ERRCODE and text message, though I'm not entirely
sure what it should be instead.  ERRCODE_FEATURE_NOT_SUPPORTED is
one possibility.

regards, tom lane

diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml
index 8feb2fb..b4b97a7 100644
*** a/doc/src/sgml/json.sgml
--- b/doc/src/sgml/json.sgml
***
*** 69,80 
regardless of the database encoding, and are checked only for syntactic
correctness (that is, that four hex digits follow literal\u/).
However, the input function for typejsonb/ is stricter: it disallows
!   Unicode escapes for non-ASCII characters (those
!   above literalU+007F/) unless the database encoding is UTF8.  It also
!   insists that any use of Unicode surrogate pairs to designate characters
!   outside the Unicode Basic Multilingual Plane be correct.  Valid Unicode
!   escapes, except for literal\u/, are then converted to the
!   equivalent ASCII or UTF8 character for storage.
   /para
  
   note
--- 69,82 
regardless of the database encoding, and are checked only for syntactic
correctness (that is, that four hex digits follow literal\u/).
However, the input function for typejsonb/ is stricter: it disallows
!   Unicode escapes for non-ASCII characters (those above literalU+007F/)
!   unless the database encoding is UTF8.  The typejsonb/ type also
!   rejects literal\u/ (because that cannot be represented in
!   productnamePostgreSQL/productname's typetext/ type), and it insists
!   that any use of Unicode surrogate pairs to designate characters outside
!   the Unicode Basic Multilingual Plane be correct.  Valid Unicode escapes
!   are converted to the equivalent ASCII or UTF8 character for storage;
!   this includes folding surrogate pairs into a single character.
   /para
  
   note
***
*** 134,140 
 row
  entrytypestring//entry
  entrytypetext//entry
! entrySee notes above concerning encoding restrictions/entry
 /row
 row
  entrytypenumber//entry
--- 136,143 
 row
  entrytypestring//entry
  entrytypetext//entry
! entryliteral\u/ is disallowed, as are non-ASCII Unicode
!  escapes if database encoding is not UTF8/entry
 /row
 row
  entrytypenumber//entry
diff --git a/doc/src/sgml/release-9.4.sgml b/doc/src/sgml/release-9.4.sgml
index 961e461..11bbf3b 100644
*** a/doc/src/sgml/release-9.4.sgml
--- b/doc/src/sgml/release-9.4.sgml
***
*** 103,124 
  
  listitem
   para
-   Unicode escapes in link linkend=datatype-jsontypeJSON/type/link
-   text values are no longer rendered with the backslash escaped
-   (Andrew Dunstan)
-  /para
- 
-  para
-   Previously, all backslashes in text values being formed into JSON
-   were escaped. Now a backslash followed by literalu/ and four
-   hexadecimal digits is not escaped, as this is a legal sequence in a
-   JSON string value, and escaping the backslash led to some perverse
-   results.
-  /para
- /listitem
- 
- listitem
-  para
When converting values of type typedate/, typetimestamp/
or typetimestamptz/
to link linkend=datatype-jsontypeJSON/type/link, render the
--- 103,108 
diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 3c137ea..4e46b0a 100644
*** a/src/backend/utils/adt/json.c
--- b/src/backend/utils/adt/json.c
*** json_lex_string(JsonLexContext *lex)
*** 806,819 
  	 * For UTF8, replace the escape sequence by the actual
  	 * utf8 character in lex-strval. Do this also for other
  	 * encodings if the escape designates an ASCII character,
! 	 * otherwise raise an error. We don't ever unescape a
! 	 * \u, since that would result in an impermissible nul
! 	 * byte.
  	 */
  
  	if (ch == 0)
  	{
! 		appendStringInfoString(lex-strval, \\u);
  	}
  	else if (GetDatabaseEncoding() == PG_UTF8)
  	{
--- 806,822 
  	 * For UTF8, replace the escape sequence by the actual
  	 * utf8 character in lex-strval. Do this also for other
  	 * encodings if the escape designates an ASCII character,
! 	 *

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Peter Geoghegan

On Thu, Jan 29, 2015 at 10:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I made the \u error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION)
 and errmsg(invalid input syntax for type json), by analogy to what's
 thrown for non-ASCII Unicode escapes in non-UTF8 encoding.  I'm not
 terribly happy with that, though.  ISTM that for both cases, this is
 not invalid syntax at all, but an implementation restriction that
 forces us to reject perfectly valid syntax.  So I think we ought to
 use a different ERRCODE and text message, though I'm not entirely
 sure what it should be instead.  ERRCODE_FEATURE_NOT_SUPPORTED is
 one possibility.

I personally prefer what you have here.

The point of JSONB is that we take a position on certain aspects like
this. We're bridging a pointedly loosey goosey interchange format,
JSON, with native PostgreSQL types. For example, we take a firm
position on encoding. The JSON type is a bit more permissive, to about
the extent that that's possible. The whole point is that we're
interpreting JSON data in a way that's consistent with *Postgres*
conventions. You'd have to interpret the data according to *some*
convention in order to do something non-trivial with it in any case,
and users usually want that.

It's really nice the way encoding is a strict implementation detail
within Postgres in general, in the sense that you know that if your
application code is concerned about encoding, you're probably thinking
about the problem incorrectly (at least once data has crossed the
database encoding border). MySQL's laxidasical attitudes here appear
to have been an enormous mistake.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Peter Geoghegan p...@heroku.com writes:
 I looked into it, and it turns out that MongoDB does not accept NUL in
 at least some contexts (for object keys). Apparently it wasn't always
 so. MongoDB previously had a security issue that was fixed by
 introducing this restriction. Their JSON-centric equivalent of
 per-column privileges was for a time compromised, because NUL
 injection was possible:

 https://www.idontplaydarts.com/2011/02/mongodb-null-byte-injection-attacks/

 It's easy to bash MongoDB, but this is still an interesting data
 point. They changed this after the fact, and yet I can find no
 evidence of any grumbling about it from end users. No one really
 noticed.

Hoo, that's interesting.  Lends some support to my half-baked idea that
we might disallow NUL in object keys even if we are able to allow it
elsewhere in JSON strings.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-29 Thread Andrew Dunstan



On 01/29/2015 05:39 PM, Tom Lane wrote:

Robert Haas robertmh...@gmail.com writes:

I have yet to understand what we fix by banning \u.  How is 
different from any other four-digit hexadecimal number that's not a
valid character in the current encoding?  What does banning that one
particular value do?

BTW, as to the point about encoding violations: we *already* ban \u
sequences that don't correspond to valid characters in the current
encoding.  The attempt to exclude U+ from the set of banned characters
was ill-advised, plain and simple.




Actually, unless the encoding is utf8 we ban all non-ascii unicode 
escapes even if they might designate a valid character in the current 
encoding. This was arrived at after some discussion here. So adding 
\u to the list of banned characters is arguably just making us more 
consistent.



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Peter Geoghegan p...@heroku.com writes:
 On Thu, Jan 29, 2015 at 10:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I made the \u error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION)
 and errmsg(invalid input syntax for type json), by analogy to what's
 thrown for non-ASCII Unicode escapes in non-UTF8 encoding.  I'm not
 terribly happy with that, though.  ISTM that for both cases, this is
 not invalid syntax at all, but an implementation restriction that
 forces us to reject perfectly valid syntax.  So I think we ought to
 use a different ERRCODE and text message, though I'm not entirely
 sure what it should be instead.  ERRCODE_FEATURE_NOT_SUPPORTED is
 one possibility.

 I personally prefer what you have here.

 The point of JSONB is that we take a position on certain aspects like
 this. We're bridging a pointedly loosey goosey interchange format,
 JSON, with native PostgreSQL types. For example, we take a firm
 position on encoding. The JSON type is a bit more permissive, to about
 the extent that that's possible. The whole point is that we're
 interpreting JSON data in a way that's consistent with *Postgres*
 conventions. You'd have to interpret the data according to *some*
 convention in order to do something non-trivial with it in any case,
 and users usually want that.

I quite agree with you, actually, in terms of that perspective.  But my
point remains: \u is not invalid JSON syntax, and neither is
\u1234.  If we choose to throw an error because we can't interpret or
process that according to our conventions, fine, but we should call it
something other than invalid syntax.

ERRCODE_UNTRANSLATABLE_CHARACTER or ERRCODE_CHARACTER_NOT_IN_REPERTOIRE
seem more apropos from here.  And I still think there's a case to be
made for ERRCODE_FEATURE_NOT_SUPPORTED, because it's at least possible
that we'd relax this restriction in future (eg, allow Unicode characters
that can be converted to the database's encoding).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Noah Misch n...@leadboat.com writes:
 On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
 So at this point I propose that we reject \u when de-escaping JSON.

 I would have agreed on 2014-12-09, and this release is the last chance to make
 such a change.  It is a bold wager that could pay off, but -1 from me anyway.

You only get to vote -1 if you have a credible alternative.  I don't see
one.

 I can already envision the blog post from the DBA staying on 9.4.0 because
 9.4.1 pulled his ability to store U+ in jsonb.

Those will be more or less the same people who bitch about text not
storing NULs; the world has not fallen.

 jsonb was *the* top-billed
 9.4 feature, and this thread started with Andrew conveying a field report of a
 scenario more obscure than storing U+.

There is a separate issue, which is that our earlier attempt to make the
world safe for \u actually broke other things.  We still need to fix
that, but I think the fix probably consists of reverting that patch and
instead disallowing \u.

 Anybody who's seriously unhappy with that can propose a patch to fix it
 properly in 9.5 or later.

 Someone can still do that by introducing a V2 of the jsonb binary format and
 preserving the ability to read both formats.  (Too bad Andres's proposal to
 include a format version didn't inform the final format, but we can wing it.)
 I agree that storing U+ as 0x00 is the best end state.

We will not need a v2 format, merely values that contain NULs.  Existing
data containing \u will be read as those six ASCII characters, which
is not really wrong since that might well be what it was anyway.

 We probably need to rethink the re-escaping behavior as well; I'm not
 sure if your latest patch is the right answer for that.

 Yes, we do.  No change to the representation of U+ is going to fix the
 following bug, but that patch does fix it:

 [local] test=# select
 test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb,
 test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb::text::jsonb;

The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
which I'm inclined to think we need to simply revert, not render even
more squirrely.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Andrew Dunstan



On 01/28/2015 12:50 AM, Noah Misch wrote:

On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 01/27/2015 02:28 PM, Tom Lane wrote:

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

So at this point I propose that we reject \u when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change.  It is a bold wager that could pay off, but -1 from me anyway.
I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+ in jsonb.  jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+.  Therefore, we have to assume many
users will notice the change.  This move would also add to the growing
evidence that our .0 releases are really beta(N+1) releases in disguise.


Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats.  (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+ as 0x00 is the best end state.





We need to make up our minds about this pretty quickly. The more radical 
move is likely to involve quite a bit of work, ISTM.


It's not clear to me how we should represent a unicode null. i.e. given 
a json of '[foo\ubar]', I get that we'd store the element as 
'foo\x00bar', but what is the result of


   (jsonb '[foo\ubar')-0

It's defined to be text so we can't just shove a binary null in the 
middle of it. Do we throw an error?


And I still want to hear more voices on the whole direction we want to 
take this.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Andrew Dunstan and...@dunslane.net writes:
 It's not clear to me how we should represent a unicode null. i.e. given 
 a json of '[foo\ubar]', I get that we'd store the element as 
 'foo\x00bar', but what is the result of

 (jsonb '[foo\ubar')-0

 It's defined to be text so we can't just shove a binary null in the 
 middle of it. Do we throw an error?

Yes, that is what I was proposing upthread.  Obviously, this needs some
thought to ensure that there's *something* useful you can do with a field
containing a nul, but we'd have little choice but to throw an error if
the user asks us to convert such a field to unescaped text.

I'd be a bit inclined to reject nuls in object field names even if we
allow them in field values, since just about everything you can usefully
do with a field name involves regarding it as text.

Another interesting implementation problem is what does indexing do with
such values --- ISTR there's an implicit conversion to C strings in there
too, at least in GIN indexes.

Anyway, there is a significant amount of work involved here, and there's
no way we're getting it done for 9.4.1, or probably 9.4.anything.  I think
our only realistic choice right now is to throw error for \u so that
we can preserve our options for doing something useful with it later.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Jim Nasby jim.na...@bluetreble.com writes:
 While I sympathize with Noah's sentiments, the only thing that makes sense to 
 me is that a JSON text field is treated the same way as we treat text. Right 
 now, that means NUL is not allowed, period.

 If no one has bitched about this with text, is it really that big a problem 
 with JSON?

Oh, people have bitched about it all right ;-).  But I think your real
question is how many applications find that restriction to be fatal.
The answer clearly is not a lot for text, and I don't see why the
answer would be different for JSON.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Jim Nasby


On 1/28/15 11:36 AM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

It's not clear to me how we should represent a unicode null. i.e. given
a json of '[foo\ubar]', I get that we'd store the element as
'foo\x00bar', but what is the result of



 (jsonb '[foo\ubar')-0



It's defined to be text so we can't just shove a binary null in the
middle of it. Do we throw an error?


Yes, that is what I was proposing upthread.  Obviously, this needs some
thought to ensure that there's *something* useful you can do with a field
containing a nul, but we'd have little choice but to throw an error if
the user asks us to convert such a field to unescaped text.

I'd be a bit inclined to reject nuls in object field names even if we
allow them in field values, since just about everything you can usefully
do with a field name involves regarding it as text.

Another interesting implementation problem is what does indexing do with
such values --- ISTR there's an implicit conversion to C strings in there
too, at least in GIN indexes.

Anyway, there is a significant amount of work involved here, and there's
no way we're getting it done for 9.4.1, or probably 9.4.anything.  I think
our only realistic choice right now is to throw error for \u so that
we can preserve our options for doing something useful with it later.


My $0.01:

While I sympathize with Noah's sentiments, the only thing that makes sense to 
me is that a JSON text field is treated the same way as we treat text. Right 
now, that means NUL is not allowed, period.

If no one has bitched about this with text, is it really that big a problem 
with JSON?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread Jeff Janes

On Wed, Jan 28, 2015 at 1:13 PM, Jim Nasby jim.na...@bluetreble.com wrote:


 My $0.01:

 While I sympathize with Noah's sentiments, the only thing that makes sense
 to me is that a JSON text field is treated the same way as we treat text.
 Right now, that means NUL is not allowed, period.

 If no one has bitched about this with text, is it really that big a
 problem with JSON?


I would bitch about it for text if I thought that it would do any good.

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

David G Johnston david.g.johns...@gmail.com writes:
 Am I missing something or has there been no consideration in this forbid
 plan on whether users will be able to retrieve, even if partially
 incorrectly, any jsonb data that has already been stored?

Data that's already been stored would look like the six characters \u,
which indeed might have been what it was anyway, since part of the
problem here is that we fail to distinguish \\u from \u.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-28 Thread David G Johnston

Tom Lane-2 wrote
Andrew Dunstan lt;

andrew@

gt; writes:
On 01/27/2015 02:28 PM, Tom Lane wrote:
Well, we can either fix it now or suffer with a broken representation
forever. I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u.
But I assume nobody likes that.

I don't think we can be in the business of rejecting valid JSON.

Actually, after studying the code a bit, I wonder if we wouldn't be best
off to do exactly that, at least for 9.4.x. At minimum we're talking
about an API change for JsonSemAction functions (which currently get the
already-de-escaped string as a C string; not gonna work for embedded
nulls). I'm not sure if there are already third-party extensions using
that API, but it seems possible, in which case changing it in a minor
release wouldn't be nice. Even ignoring that risk, making sure
we'd fixed everything seems like more than a day's work, which is as
much as I for one could spare before 9.4.1.

So at this point I propose that we reject \u when de-escaping JSON.
Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

The hybrid option is to reject the values for 9.4.1 but then commit to
removing that hack and fixing this properly in 9.4.2; we can always call
that release 9.5...

I think the it would mean rejecting valid JSON argument is utter
hogwash. We already reject, eg, \u00A0 if you're not using a UTF8
encoding. And we reject 1e1, not because that's invalid JSON
but because of an implementation restriction of our underlying numeric
type. I don't see any moral superiority of that over rejecting \u
because of an implementation restriction of our underlying text type.

Am I missing something or has there been no consideration in this forbid
plan on whether users will be able to retrieve, even if partially
incorrectly, any jsonb data that has already been stored? If we mangled
their data on input we should at least return the data and provide them a
chance to manually (or automatically depending on their data) fix our
mistake.

Given we already disallow NUL in text ISTM that allowing said data in other
text-like areas is asking for just the kind of trouble we are seeing here.
I'm OK with the proposition that those wishing to utilize NUL are relegated
to working with bytea.

From the commit Tom references down-thread:

However, this led to some perverse results in the case of Unicode
sequences.

Given that said results are not documented in the commit its hard to judge
whether a complete revert is being overly broad...

Anyway, just some observations since I'm not currently a user of JSON.
Tom's arguments and counter-arguments ring true to me in the general sense.
The DBA staying on 9.4.0 because of this change probably just needs to be
told to go back to using json and then run the update. Their data has
issues even they stay on 9.4.0 with the more accepting version of jsonb.

David J.

--
View this message in context:
http://postgresql.nabble.com/jsonb-unicode-escapes-and-escaped-backslashes-tp5834962p5835824.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 01:40 PM, Tom Lane wrote:
 In particular, I would like to suggest that the current representation of
 \u is fundamentally broken and that we have to change it, not try to
 band-aid around it.  This will mean an on-disk incompatibility for jsonb
 data containing U+, but hopefully there is very little of that out
 there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
 consider such solutions.

 Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
 the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u.
But I assume nobody likes that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes



On 01/27/2015 02:28 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 01/27/2015 01:40 PM, Tom Lane wrote:

In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on
the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u.
But I assume nobody likes that.




I don't think we can be in the business of rejecting valid JSON.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 02:28 PM, Tom Lane wrote:
 Well, we can either fix it now or suffer with a broken representation
 forever.  I'm not wedded to the exact solution I described, but I think
 we'll regret it if we don't change the representation.
 
 The only other plausible answer seems to be to flat out reject \u.
 But I assume nobody likes that.

 I don't think we can be in the business of rejecting valid JSON.

Actually, after studying the code a bit, I wonder if we wouldn't be best
off to do exactly that, at least for 9.4.x.  At minimum we're talking
about an API change for JsonSemAction functions (which currently get the
already-de-escaped string as a C string; not gonna work for embedded
nulls).  I'm not sure if there are already third-party extensions using
that API, but it seems possible, in which case changing it in a minor
release wouldn't be nice.  Even ignoring that risk, making sure
we'd fixed everything seems like more than a day's work, which is as
much as I for one could spare before 9.4.1.

Also, while the idea of throwing error only when a \0 needs to be
converted to text seems logically clean, it looks like that might pretty
much cripple the usability of such values anyhow, because we convert to
text at the drop of a hat.  So some investigation and probably additional
work would be needed to ensure you could do at least basic things with
such values.  (A function for direct conversion to bytea might be useful
too.)

I think the it would mean rejecting valid JSON argument is utter
hogwash.  We already reject, eg, \u00A0 if you're not using a UTF8
encoding.  And we reject 1e1, not because that's invalid JSON
but because of an implementation restriction of our underlying numeric
type.  I don't see any moral superiority of that over rejecting \u
because of an implementation restriction of our underlying text type.

So at this point I propose that we reject \u when de-escaping JSON.
Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

We probably need to rethink the re-escaping behavior as well; I'm not
sure if your latest patch is the right answer for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Merlin Moncure mmonc...@gmail.com writes:
 On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In particular, I would like to suggest that the current representation of
 \u is fundamentally broken and that we have to change it, not try to
 band-aid around it.  This will mean an on-disk incompatibility for jsonb
 data containing U+, but hopefully there is very little of that out
 there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
 consider such solutions.
 
 The most obvious way to store such data unambiguously is to just go ahead
 and store U+ as a NUL byte (\000).  The only problem with that is that
 then such a string cannot be considered to be a valid value of type TEXT,
 which would mean that we'd need to throw an error if we were asked to
 convert a JSON field containing such a character to text.

 Hm, does this include text out operations for display purposes?   If
 so, then any query selecting jsonb objects with null bytes would fail.
 How come we have to error out?  How about a warning indicating the
 string was truncated?

No, that's not a problem, because jsonb_out would produce an escaped
textual representation, so any null would come out as \u again.
The trouble comes up when you do something that's supposed to produce
a *non escaped* text equivalent of a JSON string value, such as
the - operator.

Arguably, - is broken already with the current coding, in that
these results are entirely inconsistent:

regression=# select '{a:foo\u0040bar}'::jsonb - 'a';
 ?column? 
--
 foo@bar
(1 row)

regression=# select '{a:foo\ubar}'::jsonb - 'a';
   ?column?   
--
 foo\ubar
(1 row)

regression=# select '{a:foo\\ubar}'::jsonb - 'a';
   ?column?   
--
 foo\ubar
(1 row)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 12:23 PM, Tom Lane wrote:
 I think coding anything is premature until we decide how we're going to
 deal with the fundamental ambiguity.

 The input \\uabcd will be stored correctly as \uabcd, but this will in 
 turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. 
 That's what the patch fixes.
 There are two problems here and this addresses one of them. The other 
 problem is the ambiguity regarding \\u and \u.

It's the same problem really, and until we have an answer about
what to do with \u, I think any patch is half-baked and possibly
counterproductive.

In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

The most obvious way to store such data unambiguously is to just go ahead
and store U+ as a NUL byte (\000).  The only problem with that is that
then such a string cannot be considered to be a valid value of type TEXT,
which would mean that we'd need to throw an error if we were asked to
convert a JSON field containing such a character to text.  I don't
particularly have a problem with that, except possibly for the time cost
of checking for \000 before allowing a conversion to occur.  While a
memchr() check might be cheap enough, we could also consider inventing a
new JEntry type code for string-containing-null, so that there's a
distinction in the type system between strings that are coercible to text
and those that are not.

If we went down a path like that, the currently proposed patch would be
quite useless.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes



On 01/27/2015 01:40 PM, Tom Lane wrote:


In particular, I would like to suggest that the current representation of
\u is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.




Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
the table what I suggested is unnecessary.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Merlin Moncure

On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 12:23 PM, Tom Lane wrote:
 I think coding anything is premature until we decide how we're going to
 deal with the fundamental ambiguity.

 The input \\uabcd will be stored correctly as \uabcd, but this will in
 turn be rendered as \uabcd, whereas it should be rendered as \\uabcd.
 That's what the patch fixes.
 There are two problems here and this addresses one of them. The other
 problem is the ambiguity regarding \\u and \u.

 It's the same problem really, and until we have an answer about
 what to do with \u, I think any patch is half-baked and possibly
 counterproductive.

 In particular, I would like to suggest that the current representation of
 \u is fundamentally broken and that we have to change it, not try to
 band-aid around it.  This will mean an on-disk incompatibility for jsonb
 data containing U+, but hopefully there is very little of that out
 there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
 consider such solutions.

 The most obvious way to store such data unambiguously is to just go ahead
 and store U+ as a NUL byte (\000).  The only problem with that is that
 then such a string cannot be considered to be a valid value of type TEXT,
 which would mean that we'd need to throw an error if we were asked to
 convert a JSON field containing such a character to text.

Hm, does this include text out operations for display purposes?   If
so, then any query selecting jsonb objects with null bytes would fail.
How come we have to error out?  How about a warning indicating the
string was truncated?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

2015-01-27 Thread Noah Misch

On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
 Andrew Dunstan and...@dunslane.net writes:
  On 01/27/2015 02:28 PM, Tom Lane wrote:
  Well, we can either fix it now or suffer with a broken representation
  forever.  I'm not wedded to the exact solution I described, but I think
  we'll regret it if we don't change the representation.

 So at this point I propose that we reject \u when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change.  It is a bold wager that could pay off, but -1 from me anyway.
I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+ in jsonb.  jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+.  Therefore, we have to assume many
users will notice the change.  This move would also add to the growing
evidence that our .0 releases are really beta(N+1) releases in disguise.

 Anybody who's seriously unhappy with that can propose a patch to fix it
 properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats.  (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+ as 0x00 is the best end state.

 We probably need to rethink the re-escaping behavior as well; I'm not
 sure if your latest patch is the right answer for that.

Yes, we do.  No change to the representation of U+ is going to fix the
following bug, but that patch does fix it:

[local] test=# select
test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb,
test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb::text::jsonb;
 ?column? | ?column? 
--+--
 t| f
(1 row)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes



On 01/27/2015 12:24 AM, Noah Misch wrote:

For the moment, maybe I could commit the fix for the non U+ case for
escape_json, and we could think some more about detecting and warning about
the problem strings.

+1 for splitting development that way.  Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.




OK, so something like this patch? I'm mildly concerned that you and I 
are the only ones commenting on this.


cheers

andrew


diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 3c137ea..8d2b38f 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -94,6 +94,8 @@ static void datum_to_json(Datum val, bool is_null, StringInfo result,
 static void add_json(Datum val, bool is_null, StringInfo result,
 		 Oid val_type, bool key_scalar);
 static text *catenate_stringinfo_string(StringInfo buffer, const char *addon);
+static void  escape_json_work(StringInfo buf, const char *str,
+			  bool jsonb_unicode);
 
 /* the null action object used for pure validation */
 static JsonSemAction nullSemAction =
@@ -2356,6 +2358,18 @@ json_object_two_arg(PG_FUNCTION_ARGS)
 void
 escape_json(StringInfo buf, const char *str)
 {
+	escape_json_work( buf, str, false);
+}
+
+void
+escape_jsonb(StringInfo buf, const char *str)
+{
+	escape_json_work( buf, str, true);
+}
+
+static void
+escape_json_work(StringInfo buf, const char *str, bool jsonb_unicode)
+{
 	const char *p;
 
 	appendStringInfoCharMacro(buf, '\');
@@ -2398,7 +2412,9 @@ escape_json(StringInfo buf, const char *str)
  * unicode escape that should be present is \u, all the
  * other unicode escapes will have been resolved.
  */
-if (p[1] == 'u' 
+if (jsonb_unicode  strncmp(p+1, u, 5) == 0)
+	appendStringInfoCharMacro(buf, *p);
+else if (!jsonb_unicode  p[1] == 'u' 
 	isxdigit((unsigned char) p[2]) 
 	isxdigit((unsigned char) p[3]) 
 	isxdigit((unsigned char) p[4]) 
diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c
index 644ea6d..22e1263 100644
--- a/src/backend/utils/adt/jsonb.c
+++ b/src/backend/utils/adt/jsonb.c
@@ -305,7 +305,7 @@ jsonb_put_escaped_value(StringInfo out, JsonbValue *scalarVal)
 			appendBinaryStringInfo(out, null, 4);
 			break;
 		case jbvString:
-			escape_json(out, pnstrdup(scalarVal-val.string.val, scalarVal-val.string.len));
+			escape_jsonb(out, pnstrdup(scalarVal-val.string.val, scalarVal-val.string.len));
 			break;
 		case jbvNumeric:
 			appendStringInfoString(out,
diff --git a/src/include/utils/json.h b/src/include/utils/json.h
index 6f69403..661d7bd 100644
--- a/src/include/utils/json.h
+++ b/src/include/utils/json.h
@@ -42,7 +42,13 @@ extern Datum json_build_array_noargs(PG_FUNCTION_ARGS);
 extern Datum json_object(PG_FUNCTION_ARGS);
 extern Datum json_object_two_arg(PG_FUNCTION_ARGS);
 
+/*
+ * escape_jsonb is more strict about unicode escapes, and will only not
+ * escape \u, as that is the only unicode escape that should still be
+ * present.
+ */
 extern void escape_json(StringInfo buf, const char *str);
+extern void escape_jsonb(StringInfo buf, const char *str);
 
 extern Datum json_typeof(PG_FUNCTION_ARGS);
 
diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out
index aa5686f..b5457c4 100644
--- a/src/test/regress/expected/jsonb.out
+++ b/src/test/regress/expected/jsonb.out
@@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04');
 (1 row)
 
 COMMIT;
--- unicode escape - backslash is not escaped
+-- unicode null - backslash not escaped
+-- note: using to_jsonb here bypasses the jsonb input routine.
+select jsonb '\u', to_jsonb(text '\u');
+  jsonb   | to_jsonb 
+--+--
+ \u | \u
+(1 row)
+
+-- any other unicode-looking escape - backslash is escaped
+-- all unicode characters should have been resolved
 select to_jsonb(text '\uabcd');
- to_jsonb 
---
- \uabcd
+ to_jsonb  
+---
+ \\uabcd
 (1 row)
 
 -- any other backslash is escaped
@@ -338,6 +347,14 @@ select to_jsonb(text '\abcd');
  \\abcd
 (1 row)
 
+-- doubled backslash should be reproduced
+-- this isn't a unicode escape, it's an escaped backslash followed by 'u'
+select jsonb '\\uabcd';
+   jsonb   
+---
+ \\uabcd
+(1 row)
+
 --jsonb_agg
 CREATE TEMP TABLE rows AS
 SELECT x, 'txt' || x as y
diff --git a/src/test/regress/expected/jsonb_1.out b/src/test/regress/expected/jsonb_1.out
index 687ae63..10aaac8 100644
--- a/src/test/regress/expected/jsonb_1.out
+++ b/src/test/regress/expected/jsonb_1.out
@@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04');
 (1 row)
 
 COMMIT;
--- unicode escape - backslash is not escaped
+-- unicode null - backslash not escaped
+-- note: using to_jsonb here bypasses the jsonb input routine.
+select jsonb '\u', to_jsonb(text '\u');
+  jsonb   | to_jsonb 
+--+--
+ \u | \u
+(1 row)
+
+-- any

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes

Andrew Dunstan and...@dunslane.net writes:
 On 01/27/2015 12:24 AM, Noah Misch wrote:
 +1 for splitting development that way.  Fixing the use of escape_json() is
 objective, but the choices around the warning are more subtle.

 OK, so something like this patch? I'm mildly concerned that you and I 
 are the only ones commenting on this.

Doesn't seem to me like this fixes anything.  If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb, unicode escapes and escaped backslashes