While this may not change the OP's need for such tool, I read the JSON specification as allowing all codepoints 0x0000 - 0xffff regardless of whether they map to "valid" unicode characters. The allowed use of quoted utf-16 surrogate pairs for characters with codepoints > 0xffff (without also specifying that unpaired surrogates are invalid) is troubling on the margin, and complicates such a validation.
Another complication is that a "JSON document" might itself be non-ascii (utf8, 16 or 32) and have unicode characters as literals within quoted strings... Not to mention the ambiguous case of a surrogate pair where half is literal and the other half quoted... > On May 7, 2015, at 2:33 PM, Mark Davis ☕️ <[email protected]> wrote: > > The simplest approach would be to use ICU in a little program that scans the > file. For example, you could write a little Java program that would scan the > file, and turn any any sequence of (\uXXXX)+ into a String, then test that > string with: > > static final UnicodeSet OK = new > UnicodeSet("[^[:unassigned:][:surrogate:]]]").freeze(); > ... > // inside the scanning function > boolean isOk = OK.containsAll(slashUString); > > It is key that it has to grab the entire sequence of \uXXXX in a row; > otherwise it will get the wrong answer. > > > Mark <https://google.com/+MarkDavis> > > — Il meglio è l’inimico del bene — > > On Thu, May 7, 2015 at 10:49 AM, Doug Ewell <[email protected] > <mailto:[email protected]>> wrote: > "Costello, Roger L." <Costello at mitre dot org> wrote: > > > Are there tools to scan a JSON document to detect the presence of > > \uXXXX, where XXXX does not correspond to any Unicode character? > > A tool like this would need to scan the Unicode Character Database, for > some given version, to determine which code points have been allocated > to a coded character in that version and which have not. > > -- > Doug Ewell | http://ewellic.org <http://ewellic.org/> | Thornton, CO 🇺🇸 > >

