2015-10-12 14:42 GMT+02:00 Mark Davis ☕️ <[email protected]>: > If these are not all aligned, then all heck breaks loose: you are letting > yourself in for code breakage and/or security problems. > > So the corresponding code point count would just return a count of 1 for > an isolated surrogate. >
But the behavior in this case is absolutely not defined, and applications are free to do what they want when they encounter them. There's not even any warranty that any further (correctly encoded) code point will be returned, even if a replacement character like U+FFFE is returned, it could replace all the rest. So the count of 1 is possible for the first isolated surrogate but all the rest count count as 0 as well, or all the further characters could be replaced by U+FFFE independantly of what they initially represented. This would also be a "sanitized" result. TUS gives freedom of choice in application. There's absolutely no warranty that all possible "sanitized" results will be the same for all applications, and TUS does not even mandate which replacement character to use (not necessarily U+FFFE, it could as well be an ASCII '?' character or a C0 <SUB> or <DEL> control, when further processed to an application converting the result to some legacy 7-bit or 8-bit charset). My opinion is that the only really safe result is to not return any count of code points but instead throw an error (counting code points and with a function returning an integer is only valid if the UTF-16 input is actually a valid representation of code points, you cannot return a single integer as the application using that integer could expect to allocate some processing buffer, and then get this exact number of code points when reading the data into some processing buffer, and could leave initialized some positions in that buffer, or the application could assume that the input was left untouched and could then get an unexpected mismatch of digital signature). If your function counting codepoints and returning an integer counts those lone surrogates as 1, it assumes that exactly one codepoint will be returned for each lone surrogate, and it should document that clearly, meaning that the result is only valid if this matches the results of the actual input scanner. In that case that function will never fail and throw an exception. But between two implementations the result of the scanner could still be different because the replacement character is not specified. If that result "sanitized" string is then used to generate an URI, the URI is also unpredictable and will vary between implementations, as well as its effective length. If it is used to generate an identifier granting some new access, such as a user name, several new user names could be generated from the same input. So in all cases using replacements will also create security problems. This will not happen if you don't return any result but throw an exception (that counting function should document this exception so that it is not unexpectedly thrown and left unhandled, causing the program to abort prematurely in an unsafe state including loosing other data or transaction elsewhere in an incoherent state). For all programs taking some standard UTF input, the input scanner or processing functions MUST be prepared to handle the encoding error exception, which is an result expected equally to the return of a value or the execution of some code ! Sanitization is possible, but not described in the standard, and there are several conflict ways of doing it, it should be a separate subprocess documented separately.

