Any system that exposes Unicode strings (not UTF-16 strings) cannot have two surrogates merge when two strings are appended. There's nothing in the Unicode standard that says that should happen for a string in an arbitrary format, and it's unreasonable behavior for a string. Thus a Unicode string simply can't be in UTF-16 format internally with unpaired surrogates; a Unicode string in a programmer opaque format must do something with broken data on input.
On 1:27pm, Mon, Oct 12, 2015 Richard Wordingham < [email protected]> wrote: > On Mon, 12 Oct 2015 17:29:13 +0200 > Philippe Verdy <[email protected]> wrote: > > > But between two implementations > > the result of the scanner could still be different because the > > replacement character is not specified. If that result "sanitized" > > string is then used to generate an URI, the URI is also unpredictable > > and will vary between implementations, as well as its effective > > length. If it is used to generate an identifier granting some new > > access, such as a user name, several new user names could be > > generated from the same input. > > TUS 8.0 Section 3 Requirement C10 has the following, wise words in its > final paragraph: > > "However, such repair of mangled data is a special case, and it must > not be used in circumstances where it would cause security problems." > > Richard. >

