On 5th June 2007, Øistein E. Andersen wrote:
(To do this properly, what we really ought to do is look for
C1 and undefined characters in all IANA charsets and semi-official
mappings to Unicode and check 1) whether the gaps can be filled
by borrowing from other encodings, and 2) whether browsers
actually do so. [...])
I have finally got round to looking at superset encodings.
To do this, I started with Unicode mappings from [UNI] for 8-bit 1-byte
alphabet encodings and added mappings for other such encodings
implemented in Opera, Safari or Firefox, mostly from [CSETS], though
I made one for Windows-Sami-2 from a PDF. (I then discovered that IE
had something called Arabic-ASMO, for which no matching specification
could be found, and subsequently reverse-engineered all IE's encodings.
Most of these turned out to be identical to other mappings or only
add characters from the PUA, but some real differences were found,
and those are reported in the text below.)
[UNI] http://unicode.org/Public/MAPPINGS/
[CSETS] http://crl.nmsu.edu/~mleisher/csets.html
All the character repertoires and encoding vectors defined by the mappings
were then compared pairwise. (Codepoints mapped to C0, space, BS or C1
were treated as unassigned, and directionality indicators for Arabic and
Hebrew were ignored.) The result is quite a big and unreadable table
[FULL], so the repertoires and encodings were clustered, which gave rise to
the tables in [ENC], which compare charsets with less than 27 incompatible
codepoints, as well as those in [REP], which compare charsets with at most
60 characters not found in both repertoires. (The thresholds are arbitrary, but
more than sufficiently large to assure that all related charsets will be
clustered together and at the sime time sufficiently small to keep the
tables at a reasonable size.)
[FULL] http://coq.no/X/charset-table.html
[ENC] http://coq.no/X/charset-enc.html
[REP] http://coq.no/X/charset-rep.html
A short summary of the most interesting/relevant results (supported by [ENC])
can be found below.
--
Øistein E. Andersen
PS: How should colour be added to tables like these in HTML5 with
neither of the attributes bgcolor and style?
PPS: Some right-to-left characters contaminate surrounding characters as I
have not yet found a simple solution to make everything strictly
left-to-right (probably because I have not looked for it properly).
Notation
x y: x is a proper subset of y
=
ASCII
=
Most of the charsets are ASCII-compatible; some are EBCDIC-based
(none of which are implemented in browsers, as far as I know).
The following are /almost/ ASCII-compatible:
CP864 uses Arabic per cent in place of of the Latin sign.
JIS-201 replaces `reverse solidus' and `tilde' with `yen' and `macron'.
See below for PostScript / NextStep.
==
Arabic, including MacArabic / MacFarsi
==
Both MacArabic and MacFarsi are close to being supersets of 8859-6.
The Macintosh encodings encode explicitly right-to-left characters `dollar'
`space' and `hyphen' in place of ISO's `generic currency sign', `non-
breaking space' and `soft hyphen'.
MS IE's so-called ASMO-708 (not treated as an 8859-6 alias as per IANA)
appears to be another rough superset of 8859-6, adding accented lowercase
letters for French and box-drawing characters, but apparently soft hyphen
or non-breaking space.
MS IE also includes Arabic-DOS, which appears to be different from all
other encodings.
Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct from
ISO-2022-JP. This is something to keep in mind when looking at
multi-byte encodings.
==
Baltic Rim
==
Despite what Wikipedia says, 8859-13 and CP1257 are not actually compatible;
the latter puts `acute accent' and `high dot' where the former has
`left double quotation mark' and `right single quotation mark'.
Cyrillic KOI
There are several KOI8-based encodings, all of which include the basic
Russian modern alphabet (except yo) in an ASCII-compatible sequence.
KOI8-unified is almost a superset of ISO-IR-111, but uppercase and
lowercase Ukrainian `Cyrillic g with upturn' replace `generic currency
sign' and `soft hyphen'.
IE's KOI-8-U is different as it includes short uppercase and lowercase
y instead of two box-drawing characters.
Comments: KOI8-RU (as opposed to KOI8-R and KOI8-U) is apparently obsolete
and best forgotten.
KOI8-unified shows all letters from any KOI8-based encoding
correctly. This one therefore seems like the best choice
if distributional analysis indicates KOI-8 of some description.
Georgian
GEO-STD-8 and GEO-PS are mostly compatible, except that the former has
`No' where the latter has `y acute'.
(GEO-STD-8 is supposedly supported by Firefox, but does not seem to work