RE: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Shawn Steele via Unicode
IMO, encodings, particularly ones depending on state such as this, may have 
multiple ways to output the same, or similar, sequences.  When means that 
pretty much any time an encoding transforms data any previous security or other 
validation style checks are no longer valid and any security/validation must be 
checked for again.  I've seen numerous mistakes due to people expecting 
encodings to play nicely, particularly if there are different endpoints that 
may use different implementations with slightly different behaviors.

-Shawn

-Original Message-
From: Unicode  On Behalf Of Henri Sivonen via 
Unicode
Sent: Sunday, August 16, 2020 11:39 PM
To: Mark Davis ☕️ 
Cc: Unicode Public 
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP 
escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there 
>> is no content between two ISO-2022-JP escape sequences from the 
>> WHATWG Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in 
>> that case is not a useful security measure when unnecessary 
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal 
>> > input byte sequence. Instead, it must stop with an error or 
>> > substitute a replacement character (such as U+FFFD (   ) 
>> > REPLACEMENT CHARACTER) or an escape sequence in the output. (See 
>> > also Section 3.5 Deletion of Code Points.) It is important to do 
>> > this not only for byte sequences that encode characters, but also for 
>> > unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next 
>> > shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 
>> > variants require at least one character in a text segment between 
>> > shift sequences. Security software written to the formal 
>> > specification may not detect malicious text  (for example, "delete" 
>> > with a shift-to-double-byte then an immediate shift-to-ASCII in the 
>> > middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of 
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into 
>> > its ISO-2022-JP decoder algorithm 
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements 
>> > the WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that 
>> > didn't implement this U+FFFD generation behavior (uconv), a bug has 
>> > been logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-44066140
>> > 3
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP 
>> > the unusual and surprising property that concatenating two 
>> > ISO-2022-JP outputs from a conforming encoder can result in a byte 
>> > sequence that is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP 
>> > escape sequence is immediately followed by another ISO-2022-JP 
>> > escape sequence. Chrome and Safari do, but their implementations of 
>> > ISO-2022-JP 

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Harriet Riddle via Unicode
In terms of deployed ISO-2022-JP encoders which don't follow WHATWG behaviour, 
here's Python's (apparently contributed to Python by one Hye-Shik Chang):

>>> "a¥bc~¥d".encode("iso-2022-jp")
b'a\x1b(J\\\x1b(Bbc~\x1b(J\\\x1b(Bd'

This is so far as I can tell valid per the RFC (and of course ECMA-35 itself), 
but not per the WHATWG, whose output would be (to use another bytestring 
literal) b'a\x1b(J\\bc\x1b(B~\x1b(J\\d\x1b(B'. The difference being that 
Python's encoder appears to be using a preference order of codesets, with ASCII 
being before JIS-Roman, while the WHATWG logic is to encode the next character 
in the current codeset if possible, and switch to another if it is not.

-- Har


From: Unicode  on behalf of Henri Sivonen via 
Unicode 
Sent: 17 August 2020 08:38
To: Mark Davis ☕️ 
Cc: Unicode Public 
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP 
escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text  (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>> > sequence is immediately followed by another ISO-2022-JP

Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2020-08-17 Thread Henri Sivonen via Unicode
Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️  wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no 
> content between escapes in "shifting" encodings like ISO-2022-JP is 
> unnecessary, and for consistency between implementations should not be 
> recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the 
> committee can look at your proposal with an eye to changing 
> http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode 
>  wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text  (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>> > sequence is immediately followed by another ISO-2022-JP escape
>> > sequence. Chrome and Safari do, but their implementations of
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's
>> > decoder implementations generally are informed by the Encoding
>> > Standard (though the ISO-2022-JP decoder specifically might not be
>> > yet), and I suspect that Safari's implementation (ICU) is either
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations,
>> > obfuscating the ASCII string "delete", could be accomplished by
>> > alternating between the ASCII and Roman states to that every other
>> > character is in the ASCII state and the rest of the Roman state.
>> >
>> > Is the requirement to generate U+FFFD when there is no content between
>> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
>> > transitions or useless transitions between ASCII and Roman are not
>> > also required to generate U+FFFD? Would it even be feasible (in terms
>> > of interop with legacy encoders) to make useless transitions between
>> > ASCII and Roman generate U+FFFD?
>> >
>> > --
>> > Henri Sivonen
>> > hsivo...@hsivonen.fi
>> > https://hsivonen.fi/
>>
>>
>>
>> --
>> Henri Sivonen
>> hsivo...@hsivonen.fi
>> https://hsivonen.fi/
>>


-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/



Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

2018-12-10 Thread Henri Sivonen via Unicode
We're about to remove the U+FFFD generation for the case where there
is no content between two ISO-2022-JP escape sequences from the WHATWG
Encoding Standard.

Is there anything wrong with my analysis that U+FFFD generation in
that case is not a useful security measure when unnecessary
transitions between the ASCII and Roman states do not generate U+FFFD?

On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen  wrote:
>
> Context: https://github.com/whatwg/encoding/issues/115
>
> Unicode Security Considerations say:
> "3.6.2 Some Output For All Input
>
> Character encoding conversion must also not simply skip an illegal
> input byte sequence. Instead, it must stop with an error or substitute
> a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
> or an escape sequence in the output. (See also Section 3.5 Deletion of
> Code Points.) It is important to do this not only for byte sequences
> that encode characters, but also for unrecognized or "empty"
> state-change sequences. For example:
> [...]
> ISO-2022 shift sequences without text characters before the next shift
> sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
> require at least one character in a text segment between shift
> sequences. Security software written to the formal specification may
> not detect malicious text  (for example, "delete" with a
> shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
> (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>
> The WHATWG Encoding Standard bakes this requirement by the means of
> "ISO-2022-JP output flag"
> (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
> ISO-2022-JP decoder algorithm
> (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>
> encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
> WHATWG spec.
>
> After Gecko switched to encoding_rs from an implementation that didn't
> implement this U+FFFD generation behavior (uconv), a bug has been
> logged in the context of decoding Japanese email in Thunderbird:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>
> Ken Lunde also recalls seeing such email:
> https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>
> The root problem seems to be that the requirement gives ISO-2022-JP
> the unusual and surprising property that concatenating two ISO-2022-JP
> outputs from a conforming encoder can result in a byte sequence that
> is non-conforming as input to a ISO-2022-JP decoder.
>
> Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
> sequence is immediately followed by another ISO-2022-JP escape
> sequence. Chrome and Safari do, but their implementations of
> ISO-2022-JP aren't independent of each other. Moreover, Chrome's
> decoder implementations generally are informed by the Encoding
> Standard (though the ISO-2022-JP decoder specifically might not be
> yet), and I suspect that Safari's implementation (ICU) is either
> informed by Unicode Security Considerations or vice versa.
>
> The example given as rationale in Unicode Security Considerations,
> obfuscating the ASCII string "delete", could be accomplished by
> alternating between the ASCII and Roman states to that every other
> character is in the ASCII state and the rest of the Roman state.
>
> Is the requirement to generate U+FFFD when there is no content between
> ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
> transitions or useless transitions between ASCII and Roman are not
> also required to generate U+FFFD? Would it even be feasible (in terms
> of interop with legacy encoders) to make useless transitions between
> ASCII and Roman generate U+FFFD?
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/



-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/