Re: Diacritical marks: Single character or combined character?

Jukka K. Korpela Fri, 06 Dec 2013 07:01:58 -0800

2013-12-06 0:45, Shriramana Sharma wrote:

In Unicode the characters with precomposed diacritics are given
"canonical equivalences" to the corresponding sequences of base
characters followed by separate diacritics. So Unicode-compliant
parsing tools should not distinguish between the two.


There is no such requirement.

What the standard says, in clause 3.2, item C6, is: “A process shall notassume that the interpretations of two canonical-equivalent charactersequences are distinct.”

So a program that sends data to another program should not expect thatthe recipient will treat U+0101 and U+0061 U+0304 as distinct. But itmay do so, and (as the standard says in this context) it may have validreasons to do so.

And the sending program may be based on specific information about thebehavior recipient. Even though you should not assume a priori that “ā”and “ā” are treated as distinct, you may do so if you actually know thatthey will.


Yucca

Re: Diacritical marks: Single character or combined character?

Reply via email to