I just wonder where the XSS attack is really an issue here. XSS attacks involve bypassing the document source domain in order to attempt to use or insert data found in another document issued or managed by another domain, in a distinct security realm.
What is a more serious issue would be the fact that the document parsed has an unknown security, and that its document is subject to an inspection (for example by an antivirus or antimalware trying to identify sensitive code which would remain usable (but hidden by the cipher-like invalid encoding that a browser would just interpret blindly). One problem with the strategy of delering invalid sequences blindly is of course the fact that such invalid sequences may be complex and could be arbitrarily ylong. But antiviri/antimalware solutions already know how to ignore these invalid sequences when trying to identify malicious code, so that it will detect more possibilities. In that case, the safest strategy for an iantivirus is effectively to discard the invalid sequences, trying to mimic what an unaware browser would do blindly with the consequence of running the potentially dangerous code. The strategy used in a browser for rendering the documentn or in an security solution when trying to detect malicious code, will then be completely opposed. Another consern is the choice of the replacement character. This document only suggests the U+FFD character which may also not pass some encoding converters used when forwarding the document to a lower layer API running the code effectively. If the code (as opposed to the normal text) is used, it will frequently be restricted only to ASCII or to a SBCS encoding. And in that case, a better substitute will be the ASCII C0 control which is noramlly invalid in plain text programming/scripting source code. Traditionally this C0 control character is SUB. IT may even be used to replace all invalid bytes of an invalid UTF-8 sequence, without changing its length (this is not always possible with U+FFFD in UTF-8 because it will be encoded as 3 bytes and there may be invalid/rejected sequences containing only 1 or 2 bytes that should survive with the same length after the replacement. Once concern is that SUB and U+FFFD have different character properties. And not all Unicode algorithms are treating it the way it should (for example in boundary breakers or in some transforms). Another concern is that even this C0 control may be used for controling some terminal functions (such uses are probably in very old applications), so some code converters are using instead the question mark (?) which is even worse as it may break a query URL, unexpectedly passing the data encoded after it to another HTTP(S) resource than the expected one, and also because it will bypass some cache-control mechanism. The document does not discuss really how to choose the replacement character. My opinion is that for UTF-8 encoded documents, the ASCII C0 control (SUB) is still better than the U+FFFD character which works well only in UTF-16 and UTF-32 encodings. It also works well with many legacy SBCS or MBCS encodings (including ISO 8859-*, Windows codepages and many PC/OEM codepages, JIS or EUC variants; it is also mapped in many EBCDIC codepages, distinctly from simple filler/padding characters that are blindly stripped in many applications as if they were just whitespaces at end of a fixed-width data field). How many replacements must be made ? My opinion is that replacements should be done so that no change occurs to the data length. For the remaining cases, data security can detect this case with strong data signatures like SHA1 for not too long documents (like HTML pages, or full email contents, with some common headers needed for their indexing or routing or delivery to the right person), or SHA256 for very short documents (like single datagrams or the value of short database fields like phone numbers or people last name or email address) or very long documents (or with security certificates over a secure channel which will also detect undetected data corruption in the end-to-end communication channel, either one-to-one or one-to-many for broadcasts and selective multicasts but this case of secure channels should not be a problem here as it also has to detect and secure many other cases than just invalid plain-text encodings, notably by man-in-the-middle attacks or replay attacks, or to reliably detect DoS attack by a broken channel with unrecoverable data losses, something that can be enforced by reasonnable timeout watchdogs, if performance of the channel should be ensured). 2012/7/27 Mark Davis ☕ <m...@macchiato.com>: > Thanks, good suggestion. > > Mark > > — Il meglio è l’inimico del bene — > > > > On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead <cewcat...@hotmail.com> > wrote: >> >> "Validation;" par 3, comment in parentheses >> ". . . (you never want to just delete it; that has security problems)." >> { COMMENT: would it be helpful here to have a reference here to the >> unicode security document that discusses this issue -- TR 36, 3.5 >> http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters ?} > >