I wouldn't be adverse to adding [:cn:][:cs:][:co:] to [:gcb:control:]. It would make it align more with the current definition of Grapheme_Base.
As to how to handle private use characters, UAX #29 already allows overriding: "This specification defines *default* mechanisms; more sophisticated implementations can *and should* tailor them for particular locales or environments." I'll file an agenda item for the August UTC meeting to consider this; you can also add your feedback to the UTC using the reporting form. Mark *— Il meglio è l’inimico del bene —* On Tue, Jul 5, 2011 at 16:31, Karl Williamson <[email protected]>wrote: > On 07/05/2011 09:29 AM, Mark Davis ☕ wrote: > >> Ah, you're right; I wasn't looking carefully enough at what you wrote. >> >> Yes, an unassigned code point (Cn) is treated as a base character. >> >> Unassigned code points are peculiar beasts, since we don't know really >> how they should behave until (and if) they are assigned. Their treatment >> by the Unicode algorithms varies based on some factors: >> >> * safety - don't have them behave in a way that causes problems >> * foresight - have them behave like the most likely candidate for >> future assignment >> * simplicity - since they shouldn't occur normally in text, don't >> spend too much time worrying about them. >> >> These are not formalized principles, just my observations on how we've >> operated over the years. >> >> Mark >> /— Il meglio è l’inimico del bene —/ >> > > Thanks for the answer. It does seem weird to me to treat them as base > characters. > > But, I'm wondering then about Cs, isolated surrogates. They also are > treated as base characters. That seems wrong to me. Since UTS18 is > starting to mention the possibility of them in regexes, perhaps this should > be addressed? > > Also, my understanding of UAX #44 is that private use code points may or > may not be treated as base characters at the application's discretion. But > this isn't mentioned in UAX#29. >

