https://bugzilla.wikimedia.org/show_bug.cgi?id=19001

--- Comment #4 from Philippe Verdy <verd...@wanadoo.fr> 2010-10-10 12:21:33 UTC 
---
Note that the presence of non-ASCII bytes in a subject line and that are not
properly reencoded with a disambiguating transfer syntax like Quoted-Printable
and Base64, should be assumled today to be encoded as UTF-8 by default. But
many legacy email user agents do not do this assumption, and just assume their
own local system encoding. The result is mojibake, where Cyrillic or Chinese
texts get displayed as if it was Windows-1252 or ISO-8859-1, or the reverse.
The result is clearly unpredictable with old email agents.

Unfortunately, the same old email agents (including webmails of various ISPs)
frequently do not support correct decoding of Quoted-Printable and Base64 as
well!

In all cases you get unpredictable mojibake with old user agents. It's time for
you to upgrade it (or to change your webmail provider). I relaly think that all
modern emlail agents should be able to use UTF-8 as the default encoding of
MIME headers (including subject lines) for all incoming mails, and should allow
the user to force it to use another encoding (because guessing the encoding
from a short subject line really does not work at all like it does on email
bodies and web pages), and should also support the Quoted-Printable and Base-64
explicit markup.

And in your case where you receive many emails in Russian with Cyrillic letters
(and not Latin) in most chracters of subject lines, the Quoted-printable
encoding is a bad choice, as MediaWiki should probably better use Base64 (which
will be shorter), even if this appears still as mojibake for you. MediaWiki
could test the string to see which of Base64 or Quoted-Printable is shorter,
and should avoid multiple Quoted-Printable sections in the same subject line
(when it contains spaces or other ASCII characters between Cyrillic words).

Google Mail uses another strategy when sending emails: not only it tries both
transport syntax, but also it parses which characters are used to use some
common ISO-8859 or CJK encodings, and then reencoed it with one of the two
transfer syntax (if there are non-ASCII characters).

Google Mail uses various tricks to detect target ISPs in order to select an
encoding that its webmail will support and display properly, and monitors the
emails received from people in your contact books, so that you'll reply to him
using the same encoding he used when sending emails to you (unfortunately, this
technic cannot be used by MediaWiki that does not have a database to record
what various ISPs in the world will support, and it does not have access to
your web contact list).

All what MediaWiki COULD do is to include a preference in your user account to
specify an encoding that you can read with YOUR email agent, and that will be
used by default if subject lines contain only characters from your preferred
selected charset (otherwise it will still fallback to UTF-8, using Base64 or
Quoted-Printable also according to your preferences, or using a transliteration
into your preffered charset if its possible without excessive losses).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to