On Mon, 28 May 2018 22:02:15 -0700 Ken Whistler via Unicode <unicode@unicode.org> wrote:
> On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote: > > One of the general principles is that combining marks inherit the > > property of their base character. > > > > Normally, "inherited" should be the only property value for > > combining marks. > > > > There have been some deviations from this over the years, for > > various reasons, and there are some properties (such as general > > category) where it is necessary to recognize the character as > > combining, but the general principle still holds. > > > > Therefore, if you are trying to see whether a string is alphabetic, > > combining marks should be "transparent" to such an algorithm. > > Generally, good advice. But there are clear exceptions. For example, > the enclosing combining marks for symbols are intended (basically) to > make symbols of a sort. And many combining marks have explicit script > assigments, so they cannot simply willy-nilly inherit the script of a > base letter if they are misapplied, for example. How would one know that they are misapplied? And what if the author of the text has broken your rules? Are such texts never to be transcribed to pukka Unicode? > This is why I recommend simply adding the Diacritic property into the > mix for testing a string. That is a closer approximation to the kind > of naive "Is this string alphabetic?" question that SunaraRaman was > asking about -- it picks up the correct subset of combining marks to > union with the set of actual isAlphabetic characters, to produce more > expected results. (Including, of course, the correct classification > of all the viramas, stackers, and killers, as well as picking up all > the nuktas.). > > Folks, please examine the set of character for Diacritic and for > Extender in: > > http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt > > to see what I'm talking about. The stuff you are looking for is > already there. Even without knowing exactly what is wanted, it looks to me as though it isn't. If he wants to allow <pulli, ZWNJ> as a substring, which he should, then that fails because there is no overlap between p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}. U+034F COMBINING GRAPHEME JOINER is also missing, apparently deliberately in the case of 'diacritic'. If one uses the definition of words in the word break algorithm, one will end up accepting combinations of letter plus enclosing circle or keycap. (A fix to the word break algorithm for that would be unpleasant.) One hopes that the requirement doesn't include accepting all single words. Every properly spelt word containing U+0E46 THAI CHARACTER MAIYAMOK will be rejected, as it will contain a space before the U+0E46. (I assume there are such words; certainly there are dictionary entries with no corresponding entries without U+0E46, such as "ตึ้ก ๆ".) At a lesser level, even English has a very few words with spaces in them, and there is no solution but to list them. Richard.