We're about to remove the U+FFFD generation for the case where there is no content between two ISO-2022-JP escape sequences from the WHATWG Encoding Standard.
Is there anything wrong with my analysis that U+FFFD generation in that case is not a useful security measure when unnecessary transitions between the ASCII and Roman states do not generate U+FFFD? On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivo...@hsivonen.fi> wrote: > > Context: https://github.com/whatwg/encoding/issues/115 > > Unicode Security Considerations say: > "3.6.2 Some Output For All Input > > Character encoding conversion must also not simply skip an illegal > input byte sequence. Instead, it must stop with an error or substitute > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER) > or an escape sequence in the output. (See also Section 3.5 Deletion of > Code Points.) It is important to do this not only for byte sequences > that encode characters, but also for unrecognized or "empty" > state-change sequences. For example: > [...] > ISO-2022 shift sequences without text characters before the next shift > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants > require at least one character in a text segment between shift > sequences. Security software written to the formal specification may > not detect malicious text (for example, "delete" with a > shift-to-double-byte then an immediate shift-to-ASCII in the middle)." > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) > > The WHATWG Encoding Standard bakes this requirement by the means of > "ISO-2022-JP output flag" > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its > ISO-2022-JP decoder algorithm > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). > > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the > WHATWG spec. > > After Gecko switched to encoding_rs from an implementation that didn't > implement this U+FFFD generation behavior (uconv), a bug has been > logged in the context of decoding Japanese email in Thunderbird: > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 > > Ken Lunde also recalls seeing such email: > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 > > The root problem seems to be that the requirement gives ISO-2022-JP > the unusual and surprising property that concatenating two ISO-2022-JP > outputs from a conforming encoder can result in a byte sequence that > is non-conforming as input to a ISO-2022-JP decoder. > > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape > sequence is immediately followed by another ISO-2022-JP escape > sequence. Chrome and Safari do, but their implementations of > ISO-2022-JP aren't independent of each other. Moreover, Chrome's > decoder implementations generally are informed by the Encoding > Standard (though the ISO-2022-JP decoder specifically might not be > yet), and I suspect that Safari's implementation (ICU) is either > informed by Unicode Security Considerations or vice versa. > > The example given as rationale in Unicode Security Considerations, > obfuscating the ASCII string "delete", could be accomplished by > alternating between the ASCII and Roman states to that every other > character is in the ASCII state and the rest of the Roman state. > > Is the requirement to generate U+FFFD when there is no content between > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII > transitions or useless transitions between ASCII and Roman are not > also required to generate U+FFFD? Would it even be feasible (in terms > of interop with legacy encoders) to make useless transitions between > ASCII and Roman generate U+FFFD? > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/