On Thu, Sep 21, 2023 at 11:26:11AM +0200, Omar Polo wrote:
> Do we really need the two checks?

WFIW, my original suggestion made off-list was about checking for 0xfe and
0xff only:

Crystal wrote:
> 0xfe and 0xff are invalid in utf-8.
> 
> It _might_ be worth detecting them and in this case not outputting any mime
> headers at all, since the data would be neither us-ascii nor valid utf-8, and
> therefore possibly some other encoding, (that the user is aware of and
> handling correctly themselves).
> 
> OTOH, if we're not doing a complete check for valid utf-8, maybe such a
> partial check is worse than no check at all.

I _didn't_ advocate putting a whole utf-8 parser in.

The rationale is that seeing 0xfe or 0xff immediately makes it an invalid
utf-8 stream, and in that case the chances of it being a different 8-bit
encoding become much more likely, but we don't know for sure so best do
no further processing of headers.

Also, 0xff can easily turn up in input piped from other broken or exploited
code, so maybe in that case we also don't want to do futher processing.

The single loop checking for ascii characters could easily check 0xfe and
0xff with a trivial change.

Reply via email to