RE: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Costello, Roger L. Fri, 08 May 2015 02:32:08 -0700

Philippe Verdy wrote:


Ø  implementations just support JSON as plain 16-bit streams

Ø  Try by yourself, you can perfectly send JSON text containing

Ø   '\uFFFF' (non-character) or '\uD800' (unpaired surrogate) and

Ø  I've not seen any JSON implementation complaining about one

Ø  or the other

Okay, I gave it a try. I created this string which contains binary data 
(sequence of arbitrary unsigned integers):

"
________________________________
æä}gõ› "

When I validated that string against this JSON Schema:

{
   "type" : "string"
}

using this online validator: https://json-schema-validator.herokuapp.com/

I got an error: Invalid JSON: parse error, line 1

I am pretty sure that Daniel is correct, JSON cannot contain arbitrary bit 
streams.


Ø  The RFC is just informative not normative

Interesting! What does that mean? JSON vendors are free to ignore the JSON RFC 
and do as they see fit?

/Roger

From: [email protected] [mailto:[email protected]] On Behalf Of Philippe Verdy
Sent: Thursday, May 07, 2015 11:08 PM
To: Daniel Bünzli
Cc: [email protected]; Costello, Roger L.; Markus Scherer
Subject: Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a 
Unicode character?

The RFC is jsut informative not normative, and thez effective usage and 
implementations just support JSON as plain 16-bit streams, even if the 
transport syntax requires encoding it in plain-text (using some UTF, not 
necessarily UTF-8 even if this is the default).
Try by yourself, you can perfectly send JSON text containing '\uFFFF' 
(non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON 
implementation complaining about one or the other, when receiving the JSON 
stream and using it in Javascript, you'll see no missing code unit or replaced 
code units and no exception as well.

2015-05-08 3:22 GMT+02:00 Daniel Bünzli 
<[email protected]<mailto:[email protected]>>:
Le vendredi, 8 mai 2015 à 02:16, Philippe Verdy a écrit :
> It would be more exact to say that JSON strings, just like strings in 
> Javascript and Java or many programming languages are just binary streams of 
> 16-bit code units.

I suggest you have a careful read at RFC 7159 as it specifically implies that 
this is not the model it supports (albeit using broken or let's say 
ambiguous/imprecise Unicode terminology).

> Then the JSON processor will decode this text and will remap it to an 
> internal UTF-16 encoding (for characters that are not escaped) and the 
> "\uXXXX" will be decoded as plain 16-bit code units. The result will be a 
> stream of 16-bit code units, which can then externally be outpout and encoded 
> or stored in any convenient encoding that preserves this stream, EVEN if this 
> is not valid UTF-16.

I don't know where you get this from but you won't find any mention of this in 
the standard. We are dealing with text, Unicode scalar values, not encodings. 
At the risk of repeating myself, read section 8.2 of RFC 7159.

Best,

Daniel

RE: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Reply via email to