while such treatment of an unassigned code points as a base characters (and the reasons to threat them this way) are logically correct, it would not be superfluous to formalize that, in my opinion.
Konstantin 2011/7/5 Mark Davis ☕ <[email protected]> > Ah, you're right; I wasn't looking carefully enough at what you wrote. > > Yes, an unassigned code point (Cn) is treated as a base character. > > Unassigned code points are peculiar beasts, since we don't know really how > they should behave until (and if) they are assigned. Their treatment by the > Unicode algorithms varies based on some factors: > > - safety - don't have them behave in a way that causes problems > - foresight - have them behave like the most likely candidate for > future assignment > - simplicity - since they shouldn't occur normally in text, don't spend > too much time worrying about them. > > These are not formalized principles, just my observations on how we've > operated over the years. > > Mark > *— Il meglio è l’inimico del bene —* > > > > On Mon, Jul 4, 2011 at 20:17, Karl Williamson <[email protected]>wrote: > >> On 07/03/2011 05:52 PM, Mark Davis ☕ wrote: >> >>> >>> >>> Mark >>> /— Il meglio è l’inimico del bene —/ >>> >>> >>> On Sat, Jul 2, 2011 at 14:58, Karl Williamson <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> I have two questions about this. >>> >>> 1) In UAX #44, it says for information about the Grapheme_Base >>> property, to see UAX #29, but that document doesn't mention this >>> property. >>> >>> >>> The documentation on Grapheme_Base in #44 is obsolete. Grapheme_Base has >>> not been used in the specification of grapheme clusters since (I >>> believe) Unicode 3.2. >>> >>> >>> 2) The definition in UAX #29 for both legacy and extended grapheme >>> clusters effectively says that any Gc=Cn code points followed by any >>> number of grapheme_extend code points is a grapheme cluster. Is >>> that what is meant? I notice that Grapheme_Base excludes Cn code >>> points. >>> >>> >>> It doesn't say that. If you had the sequence <Control Extend>, you'd >>> have a break between them according to the following rule: >>> GB4. ( Control | CR | LF ) ÷ >>> >>> It would result in two (degenerate) grapheme clusters. >>> >>> We need to fix the documentation to make this clearer. Could you let me >>> know what let you to think that "any Gc=Cn code points followed by any >>> number of grapheme_extend code points is a grapheme cluster" so that we >>> can clarify that? >>> >> >> It says that an extended grapheme cluster matches this: >> ( CRLF >> | Prepend* ( Hangul-syllable | !Control ) >> ( Grapheme_Extend | Spacing_Mark)* >> | . ) >> >> That tells me that one option for a grapheme cluster is a !Control >> followed by any number of Grapheme_Extends. >> >> Lower down it defines "Control" as >> "General_Category = Line Separator (Zl), or >> General_Category = Paragraph Separator (Zp), or >> General_Category = Control (Cc), or >> General_Category = Format (Cf) >> and not U+000D CARRIAGE RETURN (CR) >> and not U+000A LINE FEED (LF) >> and not U+200C ZERO WIDTH NON-JOINER (ZWNJ) >> and not U+200D ZERO WIDTH JOINER (ZWJ)" >> >> By that definition of Control, all Gc=Cn code points are !Control. >> Therefore a grapheme cluster can be a Cn followed by any number of >> Grapheme_Extends >> >> >>> Grapheme_Extend, on the other hand, is exactly equivalent to >>> Grapheme_Cluster_Break=Extend. >>> >>> >> >

