Replace U+FFFE by U+FFFD in my message (but there are applications that also prefer using non-characters for those replacements, this is also an additional alternative, as U+FFFE has a valid representation as well in UTF-16). U+FFFD is not the only possible replacement even if it is recommended (by a "best practrice", which is not a "requirement" for conformance purpose).
2015-10-12 17:29 GMT+02:00 Philippe Verdy <[email protected]>: > 2015-10-12 14:42 GMT+02:00 Mark Davis ☕️ <[email protected]>: > >> If these are not all aligned, then all heck breaks loose: you are letting >> yourself in for code breakage and/or security problems. >> >> So the corresponding code point count would just return a count of 1 for >> an isolated surrogate. >> > > But the behavior in this case is absolutely not defined, and applications > are free to do what they want when they encounter them. There's not even > any warranty that any further (correctly encoded) code point will be > returned, even if a replacement character like U+FFFE is returned, it could > replace all the rest. > > So the count of 1 is possible for the first isolated surrogate but all the > rest count count as 0 as well, or all the further characters could be > replaced by U+FFFE independantly of what they initially represented. This > would also be a "sanitized" result. > > TUS gives freedom of choice in application. There's absolutely no warranty > that all possible "sanitized" results will be the same for all > applications, and TUS does not even mandate which replacement character to > use (not necessarily U+FFFE, it could as well be an ASCII '?' character or > a C0 <SUB> or <DEL> control, when further processed to an application > converting the result to some legacy 7-bit or 8-bit charset). > > My opinion is that the only really safe result is to not return any count > of code points but instead throw an error (counting code points and with a > function returning an integer is only valid if the UTF-16 input is actually > a valid representation of code points, you cannot return a single integer > as the application using that integer could expect to allocate some > processing buffer, and then get this exact number of code points when > reading the data into some processing buffer, and could leave initialized > some positions in that buffer, or the application could assume that the > input was left untouched and could then get an unexpected mismatch of > digital signature). > > If your function counting codepoints and returning an integer counts those > lone surrogates as 1, it assumes that exactly one codepoint will be > returned for each lone surrogate, and it should document that clearly, > meaning that the result is only valid if this matches the results of the > actual input scanner. In that case that function will never fail and throw > an exception. But between two implementations the result of the scanner > could still be different because the replacement character is not > specified. If that result "sanitized" string is then used to generate an > URI, the URI is also unpredictable and will vary between implementations, > as well as its effective length. If it is used to generate an identifier > granting some new access, such as a user name, several new user names > could be generated from the same input. > > So in all cases using replacements will also create security problems. > This will not happen if you don't return any result but throw an exception > (that counting function should document this exception so that it is not > unexpectedly thrown and left unhandled, causing the program to abort > prematurely in an unsafe state including loosing other data or transaction > elsewhere in an incoherent state). > > For all programs taking some standard UTF input, the input scanner or > processing functions MUST be prepared to handle the encoding error > exception, which is an result expected equally to the return of a value or > the execution of some code ! Sanitization is possible, but not described in > the standard, and there are several conflict ways of doing it, it should be > a separate subprocess documented separately. >

