On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
One of the general principles is that combining marks inherit the property of their base character.

Normally, "inherited" should be the only property value for combining marks.

There have been some deviations from this over the years, for various reasons, and there are some properties (such as general category) where it is necessary to recognize the character as combining, but the general principle still holds.

Therefore, if you are trying to see whether a string is alphabetic, combining marks should be "transparent" to such an algorithm.

Generally, good advice. But there are clear exceptions. For example, the enclosing combining marks for symbols are intended (basically) to make symbols of a sort. And many combining marks have explicit script assigments, so they cannot simply willy-nilly inherit the script of a base letter if they are misapplied, for example.

This is why I recommend simply adding the Diacritic property into the mix for testing a string. That is a closer approximation to the kind of naive "Is this string alphabetic?" question that SunaraRaman was asking about -- it picks up the correct subset of combining marks to union with the set of actual isAlphabetic characters, to produce more expected results. (Including, of course, the correct classification of all the viramas, stackers, and killers, as well as picking up all the nuktas.).

Folks, please examine the set of character for Diacritic and for Extender in:

http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

to see what I'm talking about. The stuff you are looking for is already there.

--Ken

P.S. And please don't start an argument about the fact that a "virama" isn't really a "diacritic". We know that, too. ;-)


Reply via email to