2013-12-06 0:45, Shriramana Sharma wrote:

In Unicode the characters with precomposed diacritics are given
"canonical equivalences" to the corresponding sequences of base
characters followed by separate diacritics. So Unicode-compliant
parsing tools should not distinguish between the two.

There is no such requirement.

What the standard says, in clause 3.2, item C6, is: “A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct.”

So a program that sends data to another program should not expect that the recipient will treat U+0101 and U+0061 U+0304 as distinct. But it may do so, and (as the standard says in this context) it may have valid reasons to do so.

And the sending program may be based on specific information about the behavior recipient. Even though you should not assume a priori that “ā” and “ā” are treated as distinct, you may do so if you actually know that they will.

Yucca




Reply via email to