Re: [Tutor] close, but no cigar

Steven D'Aprano Tue, 23 Jul 2013 12:26:33 -0700

On 24/07/13 03:01, Marc Tompkins wrote:

On Tue, Jul 23, 2013 at 7:46 AM, Steven D'Aprano <st...@pearwood.info>wrote:

This is not quite as silly as saying that an English E, a German E and a
French E should be considered three distinct characters, but (in my
opinion) not far off it.


I half-agree, half-disagree.  It's true that the letter "E" is used
more-or-less the same in English, French, and German; after all, they all
use what's called the "Latin" alphabet, albeit with local variations.  On
the other hand, the Cyrillic alphabet contains several letters that are
visually identical to their Latin equivalents, but used quite differently -
so it's quite appropriate that they're considered different letters, and
even a different alphabet.

Correct. Even if they were the same, if legacy encoding systems treated them differently, so would
Unicode. For example, \N{DIGIT FOUR} and \N{FULLWIDTH DIGIT FOUR} have distinct code-points, even
though they are exactly the same character, since some legacy East-Asian encodings had separate
characters for "full-width" and "half-width" forms.

But I confess I have misled you. I wrote about the CJK controversy from memory,
and I'm afraid I got it completely backwards: the problem is that the glyphs
(images of the characters) are different, but not the meaning. Mea culpa.

For example, in English, we can draw the dollar sign $ in two distinct ways, with one
vertical line, or two. Unicode treats them as the same character (as do English
speakers). "Han Unification" refers to Unicode's choice to do the same for many
Han (Chinese, Korean, Japanese) ideographs with different appearance but the same
meaning. For various reasons, some technical, some social, this choice proved to be
unpopular, particularly in Japan. This issue is nothing new -- Unicode supports about
71,000 distinct East Asian ideographs, which is *far* more than the old legacy encodings
were capable of representing, so if there is a Han character that you would like to write
which Unicode doesn't support, chances are that neither does any other encoding system.

More here:

https://en.wikipedia.org/wiki/Han_unification
http://www.unicode.org/faq/han_cjk.html
http://slashdot.org/story/01/06/06/0132203/why-unicode-will-work-on-the-internet

--
Steven
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] close, but no cigar

Reply via email to