There is a bug in

  http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT

that causes round-trip compatibility problems if this table is used to
convert EUC-JP into Unicode and back.

Suggested fix: Replace in JIS0208.TXT the line

  0x815F  0x2140  0x005C  # REVERSE SOLIDUS

with

  0x815F  0x2140  0xFF3C  # FULLWIDTH REVERSE SOLIDUS

Problem description:

The JIS X 0208 code position 0x2140 is in the current table the only one
that is mapped into the Basic Latin (ASCII) range U+0000..U+007F. The
widely used EUC-JP encoding supports the union of the disjoint
repertoires of ASCII and JIS X 0208. In EUC-JP, the ASCII backslash
(0x5c) and the JIS X 0208 fullwidth backslash (0xa1 0xc0) are two
distinct characters. They are represented by distinct byte sequences and
terminal emulators assign different width properties to them. It is
therefore essential that the JIS X 0208 fullwidth backslash is mapped to
the Unicode FULLWIDTH REVERSE SOLIDUS and not -- as is done currently --
to the ASCII backslash. Mapping 0x2140 to U+005C not only causes EUC-JP
roundtrip and width headaches but also looks rather unsystematic and
out-of-place, as it is really the only JIS X 0208 character mapped to ASCII.

I have not been able to check, what JIS X 0221-1995 says here, but I
hope that they haven't made the same mistake.

I do understand that JIS X 0201 lacks the two ASCII characters U+005C
REVERSE SOLIDUS and U+007E TILDE (places U+00A5 YEN SIGN and U+203E
OVERLINE there instead), but this simply makes JIS X 0201 unsuitable for
use on POSIX platforms and cannot be an excuse for squeezing one (then
why not both?) of these two single-width characters into the JIS X 0208
mapping table.

If there really is a compelling reason for not fixing this mapping table
(version 0.9, 1994-03-08, "non-kanji mappings are provisional"), then
please add at least to

  http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT

a detailed description of this EUC-JP round-trip problem and a
justification for not solving it by fixing the mapping table to keep it
disjoint with ASCII. Thanks!

http://www.cl.cam.ac.uk/~mgk25/unicode.html#conv

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to