Re: Counting Codepoints

David Starner Mon, 12 Oct 2015 16:38:28 -0700

Any system that exposes Unicode strings (not UTF-16 strings) cannot  have
two surrogates merge when two strings are appended. There's nothing in the
Unicode standard that says that should happen for a string in an arbitrary
format, and it's unreasonable behavior for a string. Thus a Unicode string
simply can't be in UTF-16 format internally with unpaired surrogates; a
Unicode string in a programmer opaque format must do something with broken
data on input.


On 1:27pm, Mon, Oct 12, 2015 Richard Wordingham <
[email protected]> wrote:

> On Mon, 12 Oct 2015 17:29:13 +0200
> Philippe Verdy <[email protected]> wrote:
>
> > But between two implementations
> > the result of the scanner could still be different because the
> > replacement character is not specified. If that result "sanitized"
> > string is then used to generate an URI, the URI is also unpredictable
> > and will vary between implementations, as well as its effective
> > length. If it is used to generate an identifier granting some new
> > access, such as a user name, several new user names could be
> > generated from the same input.
>
> TUS 8.0 Section 3 Requirement C10 has the following, wise words in its
> final paragraph:
>
> "However, such repair of mangled data is a special case, and it must
> not be used in circumstances where it would cause security problems."
>
> Richard.
>

Re: Counting Codepoints

Reply via email to