It would be more exact to say that JSON strings, just like strings in Javascript and Java or many programming languages are just binary streams of 16-bit code units. The transport syntax of JSON does not even require that the textual syntax itself must be encoded in UTF-16, and in most cases it will be transported as UTF-8. So before processing a "text/json" content type, you have first to determine an appropriate character encoding to decode this syntax (in HTTP you would use a MIME header to specify the charset effectively used, but the "text/json" MIME type by default uses UTF-8. Then the JSON processor will decode this text and will remap it to an internal UTF-16 encoding (for characters that are not escaped) and the "\uXXXX" will be decoded as plain 16-bit code units. The result will be a stream of 16-bit code units, which can then externally be outpout and encoded or stored in any convenient encoding that preserves this stream, EVEN if this is not valid UTF-16. If you need a validation of UTF-16 this is not the job of JSON itself (or Java or Javascript or similar) but dependant on the application using the JSON data: some of them will reject the stream as invalid because they expect their input to be a valid UTF (not necessarily UTF-16 or UTF-8), or they may even restrict more the allowed characer set they support (e.g. restrict to just ASCII, or support some other encodings such as GSM encoding for SMS, or just use the lowest 8 bits of each code unit).
JSON by itself is neutral, it just assumes in its syntax that any binary stream of 16-bit code unit is encodable and transportable: it could be even used to transport executable binary code or bitmap images data (such as JPEG or PNG), provided that there's a way to represent the effective binary length (when it is not an exact multiple of 16 bits) with additional data transmited in the JSON encoded data (however the most common way for such binary data is to store them in JSON using Base64, for example with the "data:" URL-encoding scheme: this scheme is commonly used in CSS which can be safely embedded in JSON strings)... I don't think this is a bad thing of JSON: JSON strings are NOT equivalent to text (and not all text is also valid Unicode text when it uses specific encodings whose character entities don't have a one-to-one mapping in the UCS, for example with private-use characters that require an external agreement if we want to map them to PUA in the UCS, or if the encoding maps them to non-characters of the UCS), even if there's a "assumed" encoding only for characters that are not reserved by the JSON syntax and not represented as escaped sequences (this assumption is also based an an external greement for the encoding used in the transport). 2015-05-07 22:29 GMT+02:00 Daniel Bünzli <[email protected]>: > Le jeudi, 7 mai 2015 à 21:59, Markus Scherer a écrit : > > I assume that the JSON spec deliberately allows anything that Java and > JavaScript allow. In particular, there is no requirement for a Java String > or JavaScript string to contain "text", or well-formed UTF-16, or only > assigned characters. > > > Some code stores binary data (sequence of arbitrary 16-bit unsigned > integers) in a "string", just because it is easy and fairly efficient to > transport. > > > > You should "validate" *text* only when you are certain that it is indeed > text. > Section 8.2 [1] of the spec specifically says that only strings that > represent sequences of Unicode scalar values (they say "characters") are > interoperable and that strings that do not represent such sequences like > "\uDEAD" can lead to unpredictable behaviour. > > If you want to transmit binary data reliably in json you must apply some > form of binary to Unicode scalar value encoding (like in most text based > interchange formats). > > Best, > > Daniel > > [1] https://tools.ietf.org/html/rfc7159#section-8.2 > >

