On Mon, 24 May 2004 09:27:45 +0000
Brian Candler <[EMAIL PROTECTED]> wrote:

> On Mon, May 24, 2004 at 12:28:53PM +0900, Hatuka*nezumi - IKEDA Soji wrote:
> > RFC2047 says quoted-printable "is designed to allow text 
> > containing mostly ASCII characters to be decipherable on an ASCII 
> > terminal without decoding".  In general, a UTF-8 text doesn't 
> > contain "mostly ASCII".
> 
> I disagree there. In my opinion, in an ideal world, *everyone* would be
> sending text in UTF-8, regardless of language. Maybe one day we'll reach
> that point.
> 
> Suppose you write English text and include a Japanese quotation within it,
> or vice versa? Both could be UTF-8. To me it doesn't make sense to force one
> encoding for UTF-8, nor to make the user choose (who almost certainly
> doesn't care).
> 
> A simple algorithm here would be to encode using both, and see which comes
> out shorter.

OK.  Applications as rich in resouce as implemented Unicode 
feature will be implemented both encoding methods easily.
Therefore, UTF-7/8 may be encoded using both encodings.


> base64 makes the message size increase by around 1/3. quoted-printable makes
> non-ASCII characters increase by a factor of 3.
> 
> By my reckoning, if more than 11.1% of the characters need quoted-printable
> encoding, then base64 is shorter.
> 
> For other character sets like ISO-8859-1 or ISO-2022-JP, then it may make
> sense to hard-code the choice of encoding, because the preference is mostly
> for the benefit of backwards-compatibility with non-MIME-compliant mailers

I understood.  I worried that shortcuts such as `11.1% algorithm' 
would break backward compatibility on implementation for any
charsets (for example, an ISO-8859-1 subject line including a bit 
many accented characters).

At least charsets below allow both methods (while some 
implementations prefer quoted-printable for encoding):
  RFC1947 (ISO-8859-7) Greek
  RFC1555 (ISO-8859-8) Hebrew
  TIS-620 (ISO-8859-11) Thai

Perhaps more recently-coming (sometimes non-Latin) charsets allow 
both methods... is that right?


> > + By same reason, I worry some Latin-based MUAs would be able to 
> >   handle only quoted-printable text part.
> 
> Spammers routinely base64-encode their mail to try and bypass filters, so I
> think most MUAs can handle it. And of course, they wouldn't be MIME
> compliant if they couldn't.

# Hmm... Japanese SPAMs I receive are often encoded by 
# quoted-printable.


> > I think the best practice is to determin encoding method by 
> > fixed flags (recommended by each charset)
> 
> However I'm not convinced that the selection of UTF-8 necessarily makes any
> declaration at all about the language it encodes or the subset of characters
> which are likely to be used within it.

I'll not use UTF-8 for some time.  ISO-2022-JP is the most portable
charset for Japanese message at present.  Similarly, I would use 
EUC-KR for Korean messages, and so on.


> Just my 2c.
> 
> Brian.
> 

  --- nezumi

Reply via email to