Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Sat, Jan 31, 2015 at 8:25 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: I understand Andrew to be saying that if you take a 6-character string and convert it to a JSON string and then back to text, you will *usually* get back the same 6 characters you started with ... unless the first character was \, the second u, and the remainder hexadecimal digits. Then you'll get back a one-character string or an error instead. It's not hard to imagine that leading to surprising behavior, or even security vulnerabilities in applications that aren't expecting such a translation to happen under them. That *was* the case, with the now-reverted patch that changed the escaping rules. It's not anymore: regression=# select to_json('\u1234'::text); to_json --- \\u1234 (1 row) When you convert that back to text, you'll get \u1234, no more and no less. For example: regression=# select array_to_json(array['\u1234'::text]); array_to_json --- [\\u1234] (1 row) regression=# select array_to_json(array['\u1234'::text])-0; ?column? --- \\u1234 (1 row) regression=# select array_to_json(array['\u1234'::text])-0; ?column? -- \u1234 (1 row) Now, if you put in '\u1234'::jsonb and extract that string as text, you get some Unicode character or other. But I'd say that a JSON user who is surprised by that doesn't understand JSON, and definitely that they hadn't read more than about one paragraph of our description of the JSON types. Totally agree. That's why I think reverting the patch was the right thing to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Robert Haas robertmh...@gmail.com writes: I understand Andrew to be saying that if you take a 6-character string and convert it to a JSON string and then back to text, you will *usually* get back the same 6 characters you started with ... unless the first character was \, the second u, and the remainder hexadecimal digits. Then you'll get back a one-character string or an error instead. It's not hard to imagine that leading to surprising behavior, or even security vulnerabilities in applications that aren't expecting such a translation to happen under them. That *was* the case, with the now-reverted patch that changed the escaping rules. It's not anymore: regression=# select to_json('\u1234'::text); to_json --- \\u1234 (1 row) When you convert that back to text, you'll get \u1234, no more and no less. For example: regression=# select array_to_json(array['\u1234'::text]); array_to_json --- [\\u1234] (1 row) regression=# select array_to_json(array['\u1234'::text])-0; ?column? --- \\u1234 (1 row) regression=# select array_to_json(array['\u1234'::text])-0; ?column? -- \u1234 (1 row) Now, if you put in '\u1234'::jsonb and extract that string as text, you get some Unicode character or other. But I'd say that a JSON user who is surprised by that doesn't understand JSON, and definitely that they hadn't read more than about one paragraph of our description of the JSON types. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Thu, Jan 29, 2015 at 11:28 PM, Tom Lane t...@sss.pgh.pa.us wrote: The point of JSONB is that we take a position on certain aspects like this. We're bridging a pointedly loosey goosey interchange format, JSON, with native PostgreSQL types. For example, we take a firm position on encoding. The JSON type is a bit more permissive, to about the extent that that's possible. The whole point is that we're interpreting JSON data in a way that's consistent with *Postgres* conventions. You'd have to interpret the data according to *some* convention in order to do something non-trivial with it in any case, and users usually want that. I quite agree with you, actually, in terms of that perspective. Sure, but I wasn't sure that that was evident to others. To emphasize: I think it's appropriate that the JSON spec takes somewhat of a back seat approach to things like encoding and the precision of numbers. I also think it's appropriate that JSONB does not, up to and including where JSONB forbids things that the JSON spec supposes could be useful. We haven't failed users by (say) not accepting NULs, even though the spec suggests that that might be useful - we have provided them with a reasonable, concrete interpretation of that JSON data, with lots of useful operators, that they may take or leave. It really isn't historical that we have both a JSON and JSONB type. For other examples of this, see every document database in existence. Depart from this perspective, as an interchange standard author, and you end up with something like XML, which while easy to reason about isn't all that useful, or BSON, the binary interchange format, which is an oxymoron. But my point remains: \u is not invalid JSON syntax, and neither is \u1234. If we choose to throw an error because we can't interpret or process that according to our conventions, fine, but we should call it something other than invalid syntax. ERRCODE_UNTRANSLATABLE_CHARACTER or ERRCODE_CHARACTER_NOT_IN_REPERTOIRE seem more apropos from here. I see. I'd go with ERRCODE_UNTRANSLATABLE_CHARACTER, then. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d, which I'm inclined to think we need to simply revert, not render even more squirrely. Yes, that commit looks broken. If you convert from text to JSON you should get a JSON string equivalent to the text you started out with. That commit departs from that in favor of allowing \u sequences in the text being converted to turn into a single character (or perhaps an encoding error) after a text-json-text roundtrip. Maybe I haven't had enough caffeine today, but I can't see how that can possibly be a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Wed, Jan 28, 2015 at 12:48:45PM -0500, Tom Lane wrote: Noah Misch n...@leadboat.com writes: On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote: So at this point I propose that we reject \u when de-escaping JSON. I would have agreed on 2014-12-09, and this release is the last chance to make such a change. It is a bold wager that could pay off, but -1 from me anyway. You only get to vote -1 if you have a credible alternative. I don't see one. I don't love json enough to keep participating in a thread where you dismiss patches and comments from Andrew and myself as quite useless, utter hogwash and non-credible. nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Wed, Jan 28, 2015 at 03:13:47PM -0600, Jim Nasby wrote: While I sympathize with Noah's sentiments, the only thing that makes sense to me is that a JSON text field is treated the same way as we treat text. Right now, that means NUL is not allowed, period. +1 -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 01/29/2015 04:59 PM, Robert Haas wrote: On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan and...@dunslane.net wrote: On 01/29/2015 12:10 PM, Robert Haas wrote: On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d, which I'm inclined to think we need to simply revert, not render even more squirrely. Yes, that commit looks broken. If you convert from text to JSON you should get a JSON string equivalent to the text you started out with. That commit departs from that in favor of allowing \u sequences in the text being converted to turn into a single character (or perhaps an encoding error) after a text-json-text roundtrip. Maybe I haven't had enough caffeine today, but I can't see how that can possibly be a good idea. Possibly. I'm coming down more and more on the side of Tom's suggestion just to ban \u in jsonb. I think that would let us have some fairly simple and consistent rules. I'm not too worried that we'll be disallowing input that we've previously allowed. We've done that often in the past, although less often in point releases. I certainly don't want to wait a full release cycle to fix this if possible. I have yet to understand what we fix by banning \u. How is different from any other four-digit hexadecimal number that's not a valid character in the current encoding? What does banning that one particular value do? In any case, whatever we do about that issue, the idea that the text - json string transformation can *change the input string into some other string* seems like an independent problem. jsonb stores string values as postgres text values, with the unicode escapes resolved, just as it also resolves numbers and booleans into their native representation. If you want the input perfectly preserved, use json, not jsonb. I think that's made pretty clear in the docs. so text-jsonb-text is not and has never been expected to be a noop. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan and...@dunslane.net wrote: On 01/29/2015 12:10 PM, Robert Haas wrote: On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d, which I'm inclined to think we need to simply revert, not render even more squirrely. Yes, that commit looks broken. If you convert from text to JSON you should get a JSON string equivalent to the text you started out with. That commit departs from that in favor of allowing \u sequences in the text being converted to turn into a single character (or perhaps an encoding error) after a text-json-text roundtrip. Maybe I haven't had enough caffeine today, but I can't see how that can possibly be a good idea. Possibly. I'm coming down more and more on the side of Tom's suggestion just to ban \u in jsonb. I think that would let us have some fairly simple and consistent rules. I'm not too worried that we'll be disallowing input that we've previously allowed. We've done that often in the past, although less often in point releases. I certainly don't want to wait a full release cycle to fix this if possible. I have yet to understand what we fix by banning \u. How is different from any other four-digit hexadecimal number that's not a valid character in the current encoding? What does banning that one particular value do? In any case, whatever we do about that issue, the idea that the text - json string transformation can *change the input string into some other string* seems like an independent problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 01/29/2015 12:10 PM, Robert Haas wrote: On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d, which I'm inclined to think we need to simply revert, not render even more squirrely. Yes, that commit looks broken. If you convert from text to JSON you should get a JSON string equivalent to the text you started out with. That commit departs from that in favor of allowing \u sequences in the text being converted to turn into a single character (or perhaps an encoding error) after a text-json-text roundtrip. Maybe I haven't had enough caffeine today, but I can't see how that can possibly be a good idea. Possibly. I'm coming down more and more on the side of Tom's suggestion just to ban \u in jsonb. I think that would let us have some fairly simple and consistent rules. I'm not too worried that we'll be disallowing input that we've previously allowed. We've done that often in the past, although less often in point releases. I certainly don't want to wait a full release cycle to fix this if possible. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 2015-01-29 16:33:36 -0500, Andrew Dunstan wrote: On 01/29/2015 12:10 PM, Robert Haas wrote: On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane t...@sss.pgh.pa.us wrote: The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d, which I'm inclined to think we need to simply revert, not render even more squirrely. Yes, that commit looks broken. If you convert from text to JSON you should get a JSON string equivalent to the text you started out with. That commit departs from that in favor of allowing \u sequences in the text being converted to turn into a single character (or perhaps an encoding error) after a text-json-text roundtrip. Maybe I haven't had enough caffeine today, but I can't see how that can possibly be a good idea. I'm coming down more and more on the side of Tom's suggestion just to ban \u in jsonb. I think that would let us have some fairly simple and consistent rules. I'm not too worried that we'll be disallowing input that we've previously allowed. We've done that often in the past, although less often in point releases. I certainly don't want to wait a full release cycle to fix this if possible. I think waiting a full cycle would be likely to make things much worse, given 9.4's current age. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Robert Haas robertmh...@gmail.com writes: On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan and...@dunslane.net wrote: I'm coming down more and more on the side of Tom's suggestion just to ban \u in jsonb. I have yet to understand what we fix by banning \u. How is different from any other four-digit hexadecimal number that's not a valid character in the current encoding? What does banning that one particular value do? As Andrew pointed out upthread, it avoids having to answer the question of what to return for select (jsonb '[foo\ubar]')-0; or any other construct which is supposed to return an *unescaped* text representation of some JSON string value. Right now you get ?column? -- foo\ubar (1 row) Which is wrong IMO, first because it violates the premise that the output should be unescaped, and second because this output cannot be distinguished from the (correct) output of regression=# select (jsonb '[foo\\ubar]')-0; ?column? -- foo\ubar (1 row) There is no way to deliver an output that is not confusable with some other value's correct output, other than by emitting a genuine \0 byte which unfortunately we cannot support in a TEXT result. Potential solutions for this have been mooted upthread, but none of them look like they're something we can do in the very short run. So the proposal is to ban \u until such time as we can do something sane with it. In any case, whatever we do about that issue, the idea that the text - json string transformation can *change the input string into some other string* seems like an independent problem. No, it's exactly the same problem, because the reason for that breakage is an ill-advised attempt to make it safe to include \u in JSONB. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Robert Haas robertmh...@gmail.com writes: I have yet to understand what we fix by banning \u. How is different from any other four-digit hexadecimal number that's not a valid character in the current encoding? What does banning that one particular value do? BTW, as to the point about encoding violations: we *already* ban \u sequences that don't correspond to valid characters in the current encoding. The attempt to exclude U+ from the set of banned characters was ill-advised, plain and simple. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Attached is a draft patch for this. It basically reverts commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d, adds a ban of \u if that would need to be converted to text (so it still works in the plain json type, so long as you don't do much processing), and adds some regression tests. I made the \u error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION) and errmsg(invalid input syntax for type json), by analogy to what's thrown for non-ASCII Unicode escapes in non-UTF8 encoding. I'm not terribly happy with that, though. ISTM that for both cases, this is not invalid syntax at all, but an implementation restriction that forces us to reject perfectly valid syntax. So I think we ought to use a different ERRCODE and text message, though I'm not entirely sure what it should be instead. ERRCODE_FEATURE_NOT_SUPPORTED is one possibility. regards, tom lane diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 8feb2fb..b4b97a7 100644 *** a/doc/src/sgml/json.sgml --- b/doc/src/sgml/json.sgml *** *** 69,80 regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow literal\u/). However, the input function for typejsonb/ is stricter: it disallows ! Unicode escapes for non-ASCII characters (those ! above literalU+007F/) unless the database encoding is UTF8. It also ! insists that any use of Unicode surrogate pairs to designate characters ! outside the Unicode Basic Multilingual Plane be correct. Valid Unicode ! escapes, except for literal\u/, are then converted to the ! equivalent ASCII or UTF8 character for storage. /para note --- 69,82 regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow literal\u/). However, the input function for typejsonb/ is stricter: it disallows ! Unicode escapes for non-ASCII characters (those above literalU+007F/) ! unless the database encoding is UTF8. The typejsonb/ type also ! rejects literal\u/ (because that cannot be represented in ! productnamePostgreSQL/productname's typetext/ type), and it insists ! that any use of Unicode surrogate pairs to designate characters outside ! the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes ! are converted to the equivalent ASCII or UTF8 character for storage; ! this includes folding surrogate pairs into a single character. /para note *** *** 134,140 row entrytypestring//entry entrytypetext//entry ! entrySee notes above concerning encoding restrictions/entry /row row entrytypenumber//entry --- 136,143 row entrytypestring//entry entrytypetext//entry ! entryliteral\u/ is disallowed, as are non-ASCII Unicode ! escapes if database encoding is not UTF8/entry /row row entrytypenumber//entry diff --git a/doc/src/sgml/release-9.4.sgml b/doc/src/sgml/release-9.4.sgml index 961e461..11bbf3b 100644 *** a/doc/src/sgml/release-9.4.sgml --- b/doc/src/sgml/release-9.4.sgml *** *** 103,124 listitem para - Unicode escapes in link linkend=datatype-jsontypeJSON/type/link - text values are no longer rendered with the backslash escaped - (Andrew Dunstan) - /para - - para - Previously, all backslashes in text values being formed into JSON - were escaped. Now a backslash followed by literalu/ and four - hexadecimal digits is not escaped, as this is a legal sequence in a - JSON string value, and escaping the backslash led to some perverse - results. - /para - /listitem - - listitem - para When converting values of type typedate/, typetimestamp/ or typetimestamptz/ to link linkend=datatype-jsontypeJSON/type/link, render the --- 103,108 diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c index 3c137ea..4e46b0a 100644 *** a/src/backend/utils/adt/json.c --- b/src/backend/utils/adt/json.c *** json_lex_string(JsonLexContext *lex) *** 806,819 * For UTF8, replace the escape sequence by the actual * utf8 character in lex-strval. Do this also for other * encodings if the escape designates an ASCII character, ! * otherwise raise an error. We don't ever unescape a ! * \u, since that would result in an impermissible nul ! * byte. */ if (ch == 0) { ! appendStringInfoString(lex-strval, \\u); } else if (GetDatabaseEncoding() == PG_UTF8) { --- 806,822 * For UTF8, replace the escape sequence by the actual * utf8 character in lex-strval. Do this also for other * encodings if the escape designates an ASCII character, ! *
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Thu, Jan 29, 2015 at 10:20 PM, Tom Lane t...@sss.pgh.pa.us wrote: I made the \u error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION) and errmsg(invalid input syntax for type json), by analogy to what's thrown for non-ASCII Unicode escapes in non-UTF8 encoding. I'm not terribly happy with that, though. ISTM that for both cases, this is not invalid syntax at all, but an implementation restriction that forces us to reject perfectly valid syntax. So I think we ought to use a different ERRCODE and text message, though I'm not entirely sure what it should be instead. ERRCODE_FEATURE_NOT_SUPPORTED is one possibility. I personally prefer what you have here. The point of JSONB is that we take a position on certain aspects like this. We're bridging a pointedly loosey goosey interchange format, JSON, with native PostgreSQL types. For example, we take a firm position on encoding. The JSON type is a bit more permissive, to about the extent that that's possible. The whole point is that we're interpreting JSON data in a way that's consistent with *Postgres* conventions. You'd have to interpret the data according to *some* convention in order to do something non-trivial with it in any case, and users usually want that. It's really nice the way encoding is a strict implementation detail within Postgres in general, in the sense that you know that if your application code is concerned about encoding, you're probably thinking about the problem incorrectly (at least once data has crossed the database encoding border). MySQL's laxidasical attitudes here appear to have been an enormous mistake. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Peter Geoghegan p...@heroku.com writes: I looked into it, and it turns out that MongoDB does not accept NUL in at least some contexts (for object keys). Apparently it wasn't always so. MongoDB previously had a security issue that was fixed by introducing this restriction. Their JSON-centric equivalent of per-column privileges was for a time compromised, because NUL injection was possible: https://www.idontplaydarts.com/2011/02/mongodb-null-byte-injection-attacks/ It's easy to bash MongoDB, but this is still an interesting data point. They changed this after the fact, and yet I can find no evidence of any grumbling about it from end users. No one really noticed. Hoo, that's interesting. Lends some support to my half-baked idea that we might disallow NUL in object keys even if we are able to allow it elsewhere in JSON strings. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 01/29/2015 05:39 PM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: I have yet to understand what we fix by banning \u. How is different from any other four-digit hexadecimal number that's not a valid character in the current encoding? What does banning that one particular value do? BTW, as to the point about encoding violations: we *already* ban \u sequences that don't correspond to valid characters in the current encoding. The attempt to exclude U+ from the set of banned characters was ill-advised, plain and simple. Actually, unless the encoding is utf8 we ban all non-ascii unicode escapes even if they might designate a valid character in the current encoding. This was arrived at after some discussion here. So adding \u to the list of banned characters is arguably just making us more consistent. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Peter Geoghegan p...@heroku.com writes: On Thu, Jan 29, 2015 at 10:20 PM, Tom Lane t...@sss.pgh.pa.us wrote: I made the \u error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION) and errmsg(invalid input syntax for type json), by analogy to what's thrown for non-ASCII Unicode escapes in non-UTF8 encoding. I'm not terribly happy with that, though. ISTM that for both cases, this is not invalid syntax at all, but an implementation restriction that forces us to reject perfectly valid syntax. So I think we ought to use a different ERRCODE and text message, though I'm not entirely sure what it should be instead. ERRCODE_FEATURE_NOT_SUPPORTED is one possibility. I personally prefer what you have here. The point of JSONB is that we take a position on certain aspects like this. We're bridging a pointedly loosey goosey interchange format, JSON, with native PostgreSQL types. For example, we take a firm position on encoding. The JSON type is a bit more permissive, to about the extent that that's possible. The whole point is that we're interpreting JSON data in a way that's consistent with *Postgres* conventions. You'd have to interpret the data according to *some* convention in order to do something non-trivial with it in any case, and users usually want that. I quite agree with you, actually, in terms of that perspective. But my point remains: \u is not invalid JSON syntax, and neither is \u1234. If we choose to throw an error because we can't interpret or process that according to our conventions, fine, but we should call it something other than invalid syntax. ERRCODE_UNTRANSLATABLE_CHARACTER or ERRCODE_CHARACTER_NOT_IN_REPERTOIRE seem more apropos from here. And I still think there's a case to be made for ERRCODE_FEATURE_NOT_SUPPORTED, because it's at least possible that we'd relax this restriction in future (eg, allow Unicode characters that can be converted to the database's encoding). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Noah Misch n...@leadboat.com writes: On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote: So at this point I propose that we reject \u when de-escaping JSON. I would have agreed on 2014-12-09, and this release is the last chance to make such a change. It is a bold wager that could pay off, but -1 from me anyway. You only get to vote -1 if you have a credible alternative. I don't see one. I can already envision the blog post from the DBA staying on 9.4.0 because 9.4.1 pulled his ability to store U+ in jsonb. Those will be more or less the same people who bitch about text not storing NULs; the world has not fallen. jsonb was *the* top-billed 9.4 feature, and this thread started with Andrew conveying a field report of a scenario more obscure than storing U+. There is a separate issue, which is that our earlier attempt to make the world safe for \u actually broke other things. We still need to fix that, but I think the fix probably consists of reverting that patch and instead disallowing \u. Anybody who's seriously unhappy with that can propose a patch to fix it properly in 9.5 or later. Someone can still do that by introducing a V2 of the jsonb binary format and preserving the ability to read both formats. (Too bad Andres's proposal to include a format version didn't inform the final format, but we can wing it.) I agree that storing U+ as 0x00 is the best end state. We will not need a v2 format, merely values that contain NULs. Existing data containing \u will be read as those six ASCII characters, which is not really wrong since that might well be what it was anyway. We probably need to rethink the re-escaping behavior as well; I'm not sure if your latest patch is the right answer for that. Yes, we do. No change to the representation of U+ is going to fix the following bug, but that patch does fix it: [local] test=# select test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb, test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb::text::jsonb; The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d, which I'm inclined to think we need to simply revert, not render even more squirrely. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 01/28/2015 12:50 AM, Noah Misch wrote: On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 01/27/2015 02:28 PM, Tom Lane wrote: Well, we can either fix it now or suffer with a broken representation forever. I'm not wedded to the exact solution I described, but I think we'll regret it if we don't change the representation. So at this point I propose that we reject \u when de-escaping JSON. I would have agreed on 2014-12-09, and this release is the last chance to make such a change. It is a bold wager that could pay off, but -1 from me anyway. I can already envision the blog post from the DBA staying on 9.4.0 because 9.4.1 pulled his ability to store U+ in jsonb. jsonb was *the* top-billed 9.4 feature, and this thread started with Andrew conveying a field report of a scenario more obscure than storing U+. Therefore, we have to assume many users will notice the change. This move would also add to the growing evidence that our .0 releases are really beta(N+1) releases in disguise. Anybody who's seriously unhappy with that can propose a patch to fix it properly in 9.5 or later. Someone can still do that by introducing a V2 of the jsonb binary format and preserving the ability to read both formats. (Too bad Andres's proposal to include a format version didn't inform the final format, but we can wing it.) I agree that storing U+ as 0x00 is the best end state. We need to make up our minds about this pretty quickly. The more radical move is likely to involve quite a bit of work, ISTM. It's not clear to me how we should represent a unicode null. i.e. given a json of '[foo\ubar]', I get that we'd store the element as 'foo\x00bar', but what is the result of (jsonb '[foo\ubar')-0 It's defined to be text so we can't just shove a binary null in the middle of it. Do we throw an error? And I still want to hear more voices on the whole direction we want to take this. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Andrew Dunstan and...@dunslane.net writes: It's not clear to me how we should represent a unicode null. i.e. given a json of '[foo\ubar]', I get that we'd store the element as 'foo\x00bar', but what is the result of (jsonb '[foo\ubar')-0 It's defined to be text so we can't just shove a binary null in the middle of it. Do we throw an error? Yes, that is what I was proposing upthread. Obviously, this needs some thought to ensure that there's *something* useful you can do with a field containing a nul, but we'd have little choice but to throw an error if the user asks us to convert such a field to unescaped text. I'd be a bit inclined to reject nuls in object field names even if we allow them in field values, since just about everything you can usefully do with a field name involves regarding it as text. Another interesting implementation problem is what does indexing do with such values --- ISTR there's an implicit conversion to C strings in there too, at least in GIN indexes. Anyway, there is a significant amount of work involved here, and there's no way we're getting it done for 9.4.1, or probably 9.4.anything. I think our only realistic choice right now is to throw error for \u so that we can preserve our options for doing something useful with it later. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Jim Nasby jim.na...@bluetreble.com writes: While I sympathize with Noah's sentiments, the only thing that makes sense to me is that a JSON text field is treated the same way as we treat text. Right now, that means NUL is not allowed, period. If no one has bitched about this with text, is it really that big a problem with JSON? Oh, people have bitched about it all right ;-). But I think your real question is how many applications find that restriction to be fatal. The answer clearly is not a lot for text, and I don't see why the answer would be different for JSON. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 1/28/15 11:36 AM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: It's not clear to me how we should represent a unicode null. i.e. given a json of '[foo\ubar]', I get that we'd store the element as 'foo\x00bar', but what is the result of (jsonb '[foo\ubar')-0 It's defined to be text so we can't just shove a binary null in the middle of it. Do we throw an error? Yes, that is what I was proposing upthread. Obviously, this needs some thought to ensure that there's *something* useful you can do with a field containing a nul, but we'd have little choice but to throw an error if the user asks us to convert such a field to unescaped text. I'd be a bit inclined to reject nuls in object field names even if we allow them in field values, since just about everything you can usefully do with a field name involves regarding it as text. Another interesting implementation problem is what does indexing do with such values --- ISTR there's an implicit conversion to C strings in there too, at least in GIN indexes. Anyway, there is a significant amount of work involved here, and there's no way we're getting it done for 9.4.1, or probably 9.4.anything. I think our only realistic choice right now is to throw error for \u so that we can preserve our options for doing something useful with it later. My $0.01: While I sympathize with Noah's sentiments, the only thing that makes sense to me is that a JSON text field is treated the same way as we treat text. Right now, that means NUL is not allowed, period. If no one has bitched about this with text, is it really that big a problem with JSON? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Wed, Jan 28, 2015 at 1:13 PM, Jim Nasby jim.na...@bluetreble.com wrote: My $0.01: While I sympathize with Noah's sentiments, the only thing that makes sense to me is that a JSON text field is treated the same way as we treat text. Right now, that means NUL is not allowed, period. If no one has bitched about this with text, is it really that big a problem with JSON? I would bitch about it for text if I thought that it would do any good.
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
David G Johnston david.g.johns...@gmail.com writes: Am I missing something or has there been no consideration in this forbid plan on whether users will be able to retrieve, even if partially incorrectly, any jsonb data that has already been stored? Data that's already been stored would look like the six characters \u, which indeed might have been what it was anyway, since part of the problem here is that we fail to distinguish \\u from \u. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Tom Lane-2 wrote Andrew Dunstan lt; andrew@ gt; writes: On 01/27/2015 02:28 PM, Tom Lane wrote: Well, we can either fix it now or suffer with a broken representation forever. I'm not wedded to the exact solution I described, but I think we'll regret it if we don't change the representation. The only other plausible answer seems to be to flat out reject \u. But I assume nobody likes that. I don't think we can be in the business of rejecting valid JSON. Actually, after studying the code a bit, I wonder if we wouldn't be best off to do exactly that, at least for 9.4.x. At minimum we're talking about an API change for JsonSemAction functions (which currently get the already-de-escaped string as a C string; not gonna work for embedded nulls). I'm not sure if there are already third-party extensions using that API, but it seems possible, in which case changing it in a minor release wouldn't be nice. Even ignoring that risk, making sure we'd fixed everything seems like more than a day's work, which is as much as I for one could spare before 9.4.1. So at this point I propose that we reject \u when de-escaping JSON. Anybody who's seriously unhappy with that can propose a patch to fix it properly in 9.5 or later. The hybrid option is to reject the values for 9.4.1 but then commit to removing that hack and fixing this properly in 9.4.2; we can always call that release 9.5... I think the it would mean rejecting valid JSON argument is utter hogwash. We already reject, eg, \u00A0 if you're not using a UTF8 encoding. And we reject 1e1, not because that's invalid JSON but because of an implementation restriction of our underlying numeric type. I don't see any moral superiority of that over rejecting \u because of an implementation restriction of our underlying text type. Am I missing something or has there been no consideration in this forbid plan on whether users will be able to retrieve, even if partially incorrectly, any jsonb data that has already been stored? If we mangled their data on input we should at least return the data and provide them a chance to manually (or automatically depending on their data) fix our mistake. Given we already disallow NUL in text ISTM that allowing said data in other text-like areas is asking for just the kind of trouble we are seeing here. I'm OK with the proposition that those wishing to utilize NUL are relegated to working with bytea. From the commit Tom references down-thread: However, this led to some perverse results in the case of Unicode sequences. Given that said results are not documented in the commit its hard to judge whether a complete revert is being overly broad... Anyway, just some observations since I'm not currently a user of JSON. Tom's arguments and counter-arguments ring true to me in the general sense. The DBA staying on 9.4.0 because of this change probably just needs to be told to go back to using json and then run the update. Their data has issues even they stay on 9.4.0 with the more accepting version of jsonb. David J. -- View this message in context: http://postgresql.nabble.com/jsonb-unicode-escapes-and-escaped-backslashes-tp5834962p5835824.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Andrew Dunstan and...@dunslane.net writes: On 01/27/2015 01:40 PM, Tom Lane wrote: In particular, I would like to suggest that the current representation of \u is fundamentally broken and that we have to change it, not try to band-aid around it. This will mean an on-disk incompatibility for jsonb data containing U+, but hopefully there is very little of that out there yet. If we can get a fix into 9.4.1, I think it's reasonable to consider such solutions. Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on the table what I suggested is unnecessary. Well, we can either fix it now or suffer with a broken representation forever. I'm not wedded to the exact solution I described, but I think we'll regret it if we don't change the representation. The only other plausible answer seems to be to flat out reject \u. But I assume nobody likes that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 01/27/2015 02:28 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 01/27/2015 01:40 PM, Tom Lane wrote: In particular, I would like to suggest that the current representation of \u is fundamentally broken and that we have to change it, not try to band-aid around it. This will mean an on-disk incompatibility for jsonb data containing U+, but hopefully there is very little of that out there yet. If we can get a fix into 9.4.1, I think it's reasonable to consider such solutions. Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on the table what I suggested is unnecessary. Well, we can either fix it now or suffer with a broken representation forever. I'm not wedded to the exact solution I described, but I think we'll regret it if we don't change the representation. The only other plausible answer seems to be to flat out reject \u. But I assume nobody likes that. I don't think we can be in the business of rejecting valid JSON. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Andrew Dunstan and...@dunslane.net writes: On 01/27/2015 02:28 PM, Tom Lane wrote: Well, we can either fix it now or suffer with a broken representation forever. I'm not wedded to the exact solution I described, but I think we'll regret it if we don't change the representation. The only other plausible answer seems to be to flat out reject \u. But I assume nobody likes that. I don't think we can be in the business of rejecting valid JSON. Actually, after studying the code a bit, I wonder if we wouldn't be best off to do exactly that, at least for 9.4.x. At minimum we're talking about an API change for JsonSemAction functions (which currently get the already-de-escaped string as a C string; not gonna work for embedded nulls). I'm not sure if there are already third-party extensions using that API, but it seems possible, in which case changing it in a minor release wouldn't be nice. Even ignoring that risk, making sure we'd fixed everything seems like more than a day's work, which is as much as I for one could spare before 9.4.1. Also, while the idea of throwing error only when a \0 needs to be converted to text seems logically clean, it looks like that might pretty much cripple the usability of such values anyhow, because we convert to text at the drop of a hat. So some investigation and probably additional work would be needed to ensure you could do at least basic things with such values. (A function for direct conversion to bytea might be useful too.) I think the it would mean rejecting valid JSON argument is utter hogwash. We already reject, eg, \u00A0 if you're not using a UTF8 encoding. And we reject 1e1, not because that's invalid JSON but because of an implementation restriction of our underlying numeric type. I don't see any moral superiority of that over rejecting \u because of an implementation restriction of our underlying text type. So at this point I propose that we reject \u when de-escaping JSON. Anybody who's seriously unhappy with that can propose a patch to fix it properly in 9.5 or later. We probably need to rethink the re-escaping behavior as well; I'm not sure if your latest patch is the right answer for that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Merlin Moncure mmonc...@gmail.com writes: On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: In particular, I would like to suggest that the current representation of \u is fundamentally broken and that we have to change it, not try to band-aid around it. This will mean an on-disk incompatibility for jsonb data containing U+, but hopefully there is very little of that out there yet. If we can get a fix into 9.4.1, I think it's reasonable to consider such solutions. The most obvious way to store such data unambiguously is to just go ahead and store U+ as a NUL byte (\000). The only problem with that is that then such a string cannot be considered to be a valid value of type TEXT, which would mean that we'd need to throw an error if we were asked to convert a JSON field containing such a character to text. Hm, does this include text out operations for display purposes? If so, then any query selecting jsonb objects with null bytes would fail. How come we have to error out? How about a warning indicating the string was truncated? No, that's not a problem, because jsonb_out would produce an escaped textual representation, so any null would come out as \u again. The trouble comes up when you do something that's supposed to produce a *non escaped* text equivalent of a JSON string value, such as the - operator. Arguably, - is broken already with the current coding, in that these results are entirely inconsistent: regression=# select '{a:foo\u0040bar}'::jsonb - 'a'; ?column? -- foo@bar (1 row) regression=# select '{a:foo\ubar}'::jsonb - 'a'; ?column? -- foo\ubar (1 row) regression=# select '{a:foo\\ubar}'::jsonb - 'a'; ?column? -- foo\ubar (1 row) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Andrew Dunstan and...@dunslane.net writes: On 01/27/2015 12:23 PM, Tom Lane wrote: I think coding anything is premature until we decide how we're going to deal with the fundamental ambiguity. The input \\uabcd will be stored correctly as \uabcd, but this will in turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. That's what the patch fixes. There are two problems here and this addresses one of them. The other problem is the ambiguity regarding \\u and \u. It's the same problem really, and until we have an answer about what to do with \u, I think any patch is half-baked and possibly counterproductive. In particular, I would like to suggest that the current representation of \u is fundamentally broken and that we have to change it, not try to band-aid around it. This will mean an on-disk incompatibility for jsonb data containing U+, but hopefully there is very little of that out there yet. If we can get a fix into 9.4.1, I think it's reasonable to consider such solutions. The most obvious way to store such data unambiguously is to just go ahead and store U+ as a NUL byte (\000). The only problem with that is that then such a string cannot be considered to be a valid value of type TEXT, which would mean that we'd need to throw an error if we were asked to convert a JSON field containing such a character to text. I don't particularly have a problem with that, except possibly for the time cost of checking for \000 before allowing a conversion to occur. While a memchr() check might be cheap enough, we could also consider inventing a new JEntry type code for string-containing-null, so that there's a distinction in the type system between strings that are coercible to text and those that are not. If we went down a path like that, the currently proposed patch would be quite useless. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 01/27/2015 01:40 PM, Tom Lane wrote: In particular, I would like to suggest that the current representation of \u is fundamentally broken and that we have to change it, not try to band-aid around it. This will mean an on-disk incompatibility for jsonb data containing U+, but hopefully there is very little of that out there yet. If we can get a fix into 9.4.1, I think it's reasonable to consider such solutions. Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on the table what I suggested is unnecessary. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Andrew Dunstan and...@dunslane.net writes: On 01/27/2015 12:23 PM, Tom Lane wrote: I think coding anything is premature until we decide how we're going to deal with the fundamental ambiguity. The input \\uabcd will be stored correctly as \uabcd, but this will in turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. That's what the patch fixes. There are two problems here and this addresses one of them. The other problem is the ambiguity regarding \\u and \u. It's the same problem really, and until we have an answer about what to do with \u, I think any patch is half-baked and possibly counterproductive. In particular, I would like to suggest that the current representation of \u is fundamentally broken and that we have to change it, not try to band-aid around it. This will mean an on-disk incompatibility for jsonb data containing U+, but hopefully there is very little of that out there yet. If we can get a fix into 9.4.1, I think it's reasonable to consider such solutions. The most obvious way to store such data unambiguously is to just go ahead and store U+ as a NUL byte (\000). The only problem with that is that then such a string cannot be considered to be a valid value of type TEXT, which would mean that we'd need to throw an error if we were asked to convert a JSON field containing such a character to text. Hm, does this include text out operations for display purposes? If so, then any query selecting jsonb objects with null bytes would fail. How come we have to error out? How about a warning indicating the string was truncated? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 01/27/2015 02:28 PM, Tom Lane wrote: Well, we can either fix it now or suffer with a broken representation forever. I'm not wedded to the exact solution I described, but I think we'll regret it if we don't change the representation. So at this point I propose that we reject \u when de-escaping JSON. I would have agreed on 2014-12-09, and this release is the last chance to make such a change. It is a bold wager that could pay off, but -1 from me anyway. I can already envision the blog post from the DBA staying on 9.4.0 because 9.4.1 pulled his ability to store U+ in jsonb. jsonb was *the* top-billed 9.4 feature, and this thread started with Andrew conveying a field report of a scenario more obscure than storing U+. Therefore, we have to assume many users will notice the change. This move would also add to the growing evidence that our .0 releases are really beta(N+1) releases in disguise. Anybody who's seriously unhappy with that can propose a patch to fix it properly in 9.5 or later. Someone can still do that by introducing a V2 of the jsonb binary format and preserving the ability to read both formats. (Too bad Andres's proposal to include a format version didn't inform the final format, but we can wing it.) I agree that storing U+ as 0x00 is the best end state. We probably need to rethink the re-escaping behavior as well; I'm not sure if your latest patch is the right answer for that. Yes, we do. No change to the representation of U+ is going to fix the following bug, but that patch does fix it: [local] test=# select test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb, test-# $$\\u05e2$$::jsonb = $$\\u05e2$$::jsonb::text::jsonb; ?column? | ?column? --+-- t| f (1 row) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 01/27/2015 12:24 AM, Noah Misch wrote: For the moment, maybe I could commit the fix for the non U+ case for escape_json, and we could think some more about detecting and warning about the problem strings. +1 for splitting development that way. Fixing the use of escape_json() is objective, but the choices around the warning are more subtle. OK, so something like this patch? I'm mildly concerned that you and I are the only ones commenting on this. cheers andrew diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c index 3c137ea..8d2b38f 100644 --- a/src/backend/utils/adt/json.c +++ b/src/backend/utils/adt/json.c @@ -94,6 +94,8 @@ static void datum_to_json(Datum val, bool is_null, StringInfo result, static void add_json(Datum val, bool is_null, StringInfo result, Oid val_type, bool key_scalar); static text *catenate_stringinfo_string(StringInfo buffer, const char *addon); +static void escape_json_work(StringInfo buf, const char *str, + bool jsonb_unicode); /* the null action object used for pure validation */ static JsonSemAction nullSemAction = @@ -2356,6 +2358,18 @@ json_object_two_arg(PG_FUNCTION_ARGS) void escape_json(StringInfo buf, const char *str) { + escape_json_work( buf, str, false); +} + +void +escape_jsonb(StringInfo buf, const char *str) +{ + escape_json_work( buf, str, true); +} + +static void +escape_json_work(StringInfo buf, const char *str, bool jsonb_unicode) +{ const char *p; appendStringInfoCharMacro(buf, '\'); @@ -2398,7 +2412,9 @@ escape_json(StringInfo buf, const char *str) * unicode escape that should be present is \u, all the * other unicode escapes will have been resolved. */ -if (p[1] == 'u' +if (jsonb_unicode strncmp(p+1, u, 5) == 0) + appendStringInfoCharMacro(buf, *p); +else if (!jsonb_unicode p[1] == 'u' isxdigit((unsigned char) p[2]) isxdigit((unsigned char) p[3]) isxdigit((unsigned char) p[4]) diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 644ea6d..22e1263 100644 --- a/src/backend/utils/adt/jsonb.c +++ b/src/backend/utils/adt/jsonb.c @@ -305,7 +305,7 @@ jsonb_put_escaped_value(StringInfo out, JsonbValue *scalarVal) appendBinaryStringInfo(out, null, 4); break; case jbvString: - escape_json(out, pnstrdup(scalarVal-val.string.val, scalarVal-val.string.len)); + escape_jsonb(out, pnstrdup(scalarVal-val.string.val, scalarVal-val.string.len)); break; case jbvNumeric: appendStringInfoString(out, diff --git a/src/include/utils/json.h b/src/include/utils/json.h index 6f69403..661d7bd 100644 --- a/src/include/utils/json.h +++ b/src/include/utils/json.h @@ -42,7 +42,13 @@ extern Datum json_build_array_noargs(PG_FUNCTION_ARGS); extern Datum json_object(PG_FUNCTION_ARGS); extern Datum json_object_two_arg(PG_FUNCTION_ARGS); +/* + * escape_jsonb is more strict about unicode escapes, and will only not + * escape \u, as that is the only unicode escape that should still be + * present. + */ extern void escape_json(StringInfo buf, const char *str); +extern void escape_jsonb(StringInfo buf, const char *str); extern Datum json_typeof(PG_FUNCTION_ARGS); diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out index aa5686f..b5457c4 100644 --- a/src/test/regress/expected/jsonb.out +++ b/src/test/regress/expected/jsonb.out @@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04'); (1 row) COMMIT; --- unicode escape - backslash is not escaped +-- unicode null - backslash not escaped +-- note: using to_jsonb here bypasses the jsonb input routine. +select jsonb '\u', to_jsonb(text '\u'); + jsonb | to_jsonb +--+-- + \u | \u +(1 row) + +-- any other unicode-looking escape - backslash is escaped +-- all unicode characters should have been resolved select to_jsonb(text '\uabcd'); - to_jsonb --- - \uabcd + to_jsonb +--- + \\uabcd (1 row) -- any other backslash is escaped @@ -338,6 +347,14 @@ select to_jsonb(text '\abcd'); \\abcd (1 row) +-- doubled backslash should be reproduced +-- this isn't a unicode escape, it's an escaped backslash followed by 'u' +select jsonb '\\uabcd'; + jsonb +--- + \\uabcd +(1 row) + --jsonb_agg CREATE TEMP TABLE rows AS SELECT x, 'txt' || x as y diff --git a/src/test/regress/expected/jsonb_1.out b/src/test/regress/expected/jsonb_1.out index 687ae63..10aaac8 100644 --- a/src/test/regress/expected/jsonb_1.out +++ b/src/test/regress/expected/jsonb_1.out @@ -324,11 +324,20 @@ select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04'); (1 row) COMMIT; --- unicode escape - backslash is not escaped +-- unicode null - backslash not escaped +-- note: using to_jsonb here bypasses the jsonb input routine. +select jsonb '\u', to_jsonb(text '\u'); + jsonb | to_jsonb +--+-- + \u | \u +(1 row) + +-- any
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
Andrew Dunstan and...@dunslane.net writes: On 01/27/2015 12:24 AM, Noah Misch wrote: +1 for splitting development that way. Fixing the use of escape_json() is objective, but the choices around the warning are more subtle. OK, so something like this patch? I'm mildly concerned that you and I are the only ones commenting on this. Doesn't seem to me like this fixes anything. If the content of a jsonb value is correct, the output will be the same with or without this patch; and if it's not, this isn't going to really improve matters. I think coding anything is premature until we decide how we're going to deal with the fundamental ambiguity. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 01/27/2015 12:23 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 01/27/2015 12:24 AM, Noah Misch wrote: +1 for splitting development that way. Fixing the use of escape_json() is objective, but the choices around the warning are more subtle. OK, so something like this patch? I'm mildly concerned that you and I are the only ones commenting on this. Doesn't seem to me like this fixes anything. If the content of a jsonb value is correct, the output will be the same with or without this patch; and if it's not, this isn't going to really improve matters. I think coding anything is premature until we decide how we're going to deal with the fundamental ambiguity. The input \\uabcd will be stored correctly as \uabcd, but this will in turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. That's what the patch fixes. There are two problems here and this addresses one of them. The other problem is the ambiguity regarding \\u and \u. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On 01/23/2015 02:18 AM, Noah Misch wrote: On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote: The following case has just been brought to my attention (look at the differing number of backslashes): andrew=# select jsonb '\\u'; jsonb -- \u (1 row) andrew=# select jsonb '\u'; jsonb -- \u (1 row) A mess indeed. The input is unambiguous, but the jsonb representation can't distinguish \u from \\u. Some operations on the original json type have similar problems, since they use an in-memory binary representation with the same shortcoming: [local] test=# select json_array_element_text($$[\u]$$, 0) = test-# json_array_element_text($$[\\u]$$, 0); ?column? -- t (1 row) Things get worse, though. On output, '\uabcd' for any four hex digits is recognized as a unicode escape, and thus the backslash is not escaped, so that we get: andrew=# select jsonb '\\uabcd'; jsonb -- \uabcd (1 row) We could probably fix this fairly easily for non- U+ cases by having jsonb_to_cstring use a different escape_json routine. Sounds reasonable. For 9.4.1, before more people upgrade? But it's a mess, sadly, and I'm not sure what a good fix for the U+ case would look like. Agreed. When a string unescape algorithm removes some kinds of backslash escapes and not others, it's nigh inevitable that two semantically-distinct inputs can yield the same output. json_lex_string() fell into that trap by making an exception for \u. To fix this, the result needs to be fully unescaped (\u converted to the NUL byte) or retain all backslash escapes. (Changing that either way is no fun now that an on-disk format is at stake.) Maybe we should detect such input and emit a warning of ambiguity? It's likely to be rare enough, but clearly not as rare as we'd like, since this is a report from the field. Perhaps. Something like WARNING: jsonb cannot represent \\u; reading as \u? Alas, but I do prefer that to silent data corruption. Maybe something like this patch. I have two concerns about it, though. The first is the possible performance impact of looking for the offending string in every jsonb input, and the second is that the test isn't quite right, since input of \\\u doesn't raise this issue - i.e. the problem arises when u is preceded by an even number of backslashes. For the moment, maybe I could commit the fix for the non U+ case for escape_json, and we could think some more about detecting and warning about the problem strings. cheers andrew diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c index 3c137ea..8d2b38f 100644 --- a/src/backend/utils/adt/json.c +++ b/src/backend/utils/adt/json.c @@ -94,6 +94,8 @@ static void datum_to_json(Datum val, bool is_null, StringInfo result, static void add_json(Datum val, bool is_null, StringInfo result, Oid val_type, bool key_scalar); static text *catenate_stringinfo_string(StringInfo buffer, const char *addon); +static void escape_json_work(StringInfo buf, const char *str, + bool jsonb_unicode); /* the null action object used for pure validation */ static JsonSemAction nullSemAction = @@ -2356,6 +2358,18 @@ json_object_two_arg(PG_FUNCTION_ARGS) void escape_json(StringInfo buf, const char *str) { + escape_json_work( buf, str, false); +} + +void +escape_jsonb(StringInfo buf, const char *str) +{ + escape_json_work( buf, str, true); +} + +static void +escape_json_work(StringInfo buf, const char *str, bool jsonb_unicode) +{ const char *p; appendStringInfoCharMacro(buf, '\'); @@ -2398,7 +2412,9 @@ escape_json(StringInfo buf, const char *str) * unicode escape that should be present is \u, all the * other unicode escapes will have been resolved. */ -if (p[1] == 'u' +if (jsonb_unicode strncmp(p+1, u, 5) == 0) + appendStringInfoCharMacro(buf, *p); +else if (!jsonb_unicode p[1] == 'u' isxdigit((unsigned char) p[2]) isxdigit((unsigned char) p[3]) isxdigit((unsigned char) p[4]) diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index 644ea6d..ba1e7f0 100644 --- a/src/backend/utils/adt/jsonb.c +++ b/src/backend/utils/adt/jsonb.c @@ -218,6 +218,14 @@ jsonb_from_cstring(char *json, int len) JsonbInState state; JsonSemAction sem; + if (strstr(json,u) != NULL) + ereport(WARNING, +(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), + errmsg(jsonb cannot represent u; reading as +\\u), + errdetail(Due to an implementation restriction, jsonb cannot represent a unicode null immediately preceded by a backslash.))); + elog(WARNING,); + memset(state, 0, sizeof(state)); memset(sem, 0, sizeof(sem)); lex = makeJsonLexContextCstringLen(json, len, true); @@ -305,7 +313,7 @@ jsonb_put_escaped_value(StringInfo out, JsonbValue *scalarVal)
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Mon, Jan 26, 2015 at 09:20:54AM -0500, Andrew Dunstan wrote: On 01/23/2015 02:18 AM, Noah Misch wrote: On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote: We could probably fix this fairly easily for non- U+ cases by having jsonb_to_cstring use a different escape_json routine. Maybe we should detect such input and emit a warning of ambiguity? It's likely to be rare enough, but clearly not as rare as we'd like, since this is a report from the field. Perhaps. Something like WARNING: jsonb cannot represent \\u; reading as \u? Alas, but I do prefer that to silent data corruption. Maybe something like this patch. I have two concerns about it, though. The first is the possible performance impact of looking for the offending string in every jsonb input, and the second is that the test isn't quite right, since input of \\\u doesn't raise this issue - i.e. the problem arises when u is preceded by an even number of backslashes. I share your second concern. It is important that this warning not cry wolf; it should never fire for an input that is actually unaffected. For the moment, maybe I could commit the fix for the non U+ case for escape_json, and we could think some more about detecting and warning about the problem strings. +1 for splitting development that way. Fixing the use of escape_json() is objective, but the choices around the warning are more subtle. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb, unicode escapes and escaped backslashes
On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote: The following case has just been brought to my attention (look at the differing number of backslashes): andrew=# select jsonb '\\u'; jsonb -- \u (1 row) andrew=# select jsonb '\u'; jsonb -- \u (1 row) A mess indeed. The input is unambiguous, but the jsonb representation can't distinguish \u from \\u. Some operations on the original json type have similar problems, since they use an in-memory binary representation with the same shortcoming: [local] test=# select json_array_element_text($$[\u]$$, 0) = test-# json_array_element_text($$[\\u]$$, 0); ?column? -- t (1 row) Things get worse, though. On output, '\uabcd' for any four hex digits is recognized as a unicode escape, and thus the backslash is not escaped, so that we get: andrew=# select jsonb '\\uabcd'; jsonb -- \uabcd (1 row) We could probably fix this fairly easily for non- U+ cases by having jsonb_to_cstring use a different escape_json routine. Sounds reasonable. For 9.4.1, before more people upgrade? But it's a mess, sadly, and I'm not sure what a good fix for the U+ case would look like. Agreed. When a string unescape algorithm removes some kinds of backslash escapes and not others, it's nigh inevitable that two semantically-distinct inputs can yield the same output. json_lex_string() fell into that trap by making an exception for \u. To fix this, the result needs to be fully unescaped (\u converted to the NUL byte) or retain all backslash escapes. (Changing that either way is no fun now that an on-disk format is at stake.) Maybe we should detect such input and emit a warning of ambiguity? It's likely to be rare enough, but clearly not as rare as we'd like, since this is a report from the field. Perhaps. Something like WARNING: jsonb cannot represent \\u; reading as \u? Alas, but I do prefer that to silent data corruption. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers