On Mon, 24 May 2004 09:27:45 +0000 Brian Candler <[EMAIL PROTECTED]> wrote:
> On Mon, May 24, 2004 at 12:28:53PM +0900, Hatuka*nezumi - IKEDA Soji wrote: > > RFC2047 says quoted-printable "is designed to allow text > > containing mostly ASCII characters to be decipherable on an ASCII > > terminal without decoding". In general, a UTF-8 text doesn't > > contain "mostly ASCII". > > I disagree there. In my opinion, in an ideal world, *everyone* would be > sending text in UTF-8, regardless of language. Maybe one day we'll reach > that point. > > Suppose you write English text and include a Japanese quotation within it, > or vice versa? Both could be UTF-8. To me it doesn't make sense to force one > encoding for UTF-8, nor to make the user choose (who almost certainly > doesn't care). > > A simple algorithm here would be to encode using both, and see which comes > out shorter. OK. Applications as rich in resouce as implemented Unicode feature will be implemented both encoding methods easily. Therefore, UTF-7/8 may be encoded using both encodings. > base64 makes the message size increase by around 1/3. quoted-printable makes > non-ASCII characters increase by a factor of 3. > > By my reckoning, if more than 11.1% of the characters need quoted-printable > encoding, then base64 is shorter. > > For other character sets like ISO-8859-1 or ISO-2022-JP, then it may make > sense to hard-code the choice of encoding, because the preference is mostly > for the benefit of backwards-compatibility with non-MIME-compliant mailers I understood. I worried that shortcuts such as `11.1% algorithm' would break backward compatibility on implementation for any charsets (for example, an ISO-8859-1 subject line including a bit many accented characters). At least charsets below allow both methods (while some implementations prefer quoted-printable for encoding): RFC1947 (ISO-8859-7) Greek RFC1555 (ISO-8859-8) Hebrew TIS-620 (ISO-8859-11) Thai Perhaps more recently-coming (sometimes non-Latin) charsets allow both methods... is that right? > > + By same reason, I worry some Latin-based MUAs would be able to > > handle only quoted-printable text part. > > Spammers routinely base64-encode their mail to try and bypass filters, so I > think most MUAs can handle it. And of course, they wouldn't be MIME > compliant if they couldn't. # Hmm... Japanese SPAMs I receive are often encoded by # quoted-printable. > > I think the best practice is to determin encoding method by > > fixed flags (recommended by each charset) > > However I'm not convinced that the selection of UTF-8 necessarily makes any > declaration at all about the language it encodes or the subset of characters > which are likely to be used within it. I'll not use UTF-8 for some time. ISO-2022-JP is the most portable charset for Japanese message at present. Similarly, I would use EUC-KR for Korean messages, and so on. > Just my 2c. > > Brian. > --- nezumi
