Karsten Bräckelmann wrote: > >>> Maybe the devs can briefly explain how the charset is being determined. >>> Or at least, where exactly in the code one could find it... >>> > > Matt, also, I got a feeling, that logic is what the OP is actually > about. He does not want to leave out what he wants to be scored on. But > (positively) define it. >
That much is easy. It's done by looking at various character-set tags or encoding marks in the message. These explicitly specify which character set to use when interpreting the text. Re-quoting myself from 11/26 (and elaborating with more examples): CHARSET_FARAWAY: Underlying eval function: check_for_faraway_charset() in MIMEEval.pm Detects based on: character set in the mime Content-Type: of the message header. Example (in a message header): Content-Type: text/plain; charset="iso-2022-jp" which specifies Japanese text for a single-part message. MIME_CHARSET_FARAWAY Underlying eval function: check_for_mime('mime_faraway_charset') in MIMEEval.pm Detects based on: character set in the mime Content-Type: of the message attachments Example (in a mime-section header): Content-Type: text/plain; charset="iso-2022-jp" which specifies Japanese text for this part of a multi-part message. HTML_CHARSET_FARAWAY Underlying eval function: html_charset_faraway() in HTMLEval.pm Detects based on: character set in the Content-Type: of a meta http-equiv tag embedded in HTML. Example: <META http-equiv=Content-Type content="text/html; charset=iso-2022-jp"> which specifies Japanese text for this html document. CHARSET_FARAWAY_HEADER check_for_faraway_charset_in_headers() Detects based on: Embedded charachter encoding marks in the Subject and From: headers. You'd have to look at the raw message source to see it, but it's generally things like this somewhere in the header: =?GB2312? Which indicates encoded simplified Chinese text follows.