Re: Questions about UAX #29

Konstantin Ritt Tue, 05 Jul 2011 13:52:58 -0700

while such treatment of  an unassigned code points as a base characters (and
the reasons to threat them this way) are logically correct, it would not be
superfluous to formalize that, in my opinion.


Konstantin



2011/7/5 Mark Davis ☕ <[email protected]>

> Ah, you're right; I wasn't looking carefully enough at what you wrote.
>
> Yes, an unassigned code point (Cn) is treated as a base character.
>
> Unassigned code points are peculiar beasts, since we don't know really how
> they should behave until (and if) they are assigned. Their treatment by  the
> Unicode algorithms varies based on some factors:
>
>    - safety - don't have them behave in a way that causes problems
>    - foresight - have them behave like the most likely candidate for
>    future assignment
>    - simplicity - since they shouldn't occur normally in text, don't spend
>    too much time worrying about them.
>
> These are not formalized principles, just my observations on how we've
> operated over the years.
>
> Mark
> *— Il meglio è l’inimico del bene —*
>
>
>
> On Mon, Jul 4, 2011 at 20:17, Karl Williamson <[email protected]>wrote:
>
>> On 07/03/2011 05:52 PM, Mark Davis ☕ wrote:
>>
>>>
>>>
>>> Mark
>>> /— Il meglio è l’inimico del bene —/
>>>
>>>
>>> On Sat, Jul 2, 2011 at 14:58, Karl Williamson <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>    I have two questions about this.
>>>
>>>    1) In UAX #44, it says for information about the Grapheme_Base
>>>    property, to see UAX #29, but that document doesn't mention this
>>>    property.
>>>
>>>
>>> The documentation on Grapheme_Base in #44 is obsolete. Grapheme_Base has
>>> not been used in the specification of grapheme clusters since (I
>>> believe) Unicode 3.2.
>>>
>>>
>>>    2) The definition in UAX #29 for both legacy and extended grapheme
>>>    clusters effectively says that any Gc=Cn code points followed by any
>>>    number of grapheme_extend code points is a grapheme cluster.  Is
>>>    that what is meant?  I notice that Grapheme_Base excludes Cn code
>>>    points.
>>>
>>>
>>> It doesn't say that. If you had the sequence <Control Extend>, you'd
>>> have a break between them according to the following rule:
>>> GB4.    ( Control | CR | LF )   ÷
>>>
>>> It would result in two (degenerate) grapheme clusters.
>>>
>>> We need to fix the documentation to make this clearer. Could you let me
>>> know what let you to think that "any Gc=Cn code points followed by any
>>> number of grapheme_extend code points is a grapheme cluster" so that we
>>> can clarify that?
>>>
>>
>> It says that an extended grapheme cluster matches this:
>> ( CRLF
>> | Prepend* ( Hangul-syllable | !Control )
>>  ( Grapheme_Extend | Spacing_Mark)*
>> | . )
>>
>> That tells me that one option for a grapheme cluster is a !Control
>> followed by any number of Grapheme_Extends.
>>
>> Lower down it defines "Control" as
>> "General_Category = Line Separator (Zl), or
>> General_Category = Paragraph Separator (Zp), or
>> General_Category = Control (Cc), or
>> General_Category = Format (Cf)
>> and not U+000D CARRIAGE RETURN (CR)
>> and not U+000A LINE FEED (LF)
>> and not U+200C ZERO WIDTH NON-JOINER (ZWNJ)
>> and not U+200D ZERO WIDTH JOINER (ZWJ)"
>>
>> By that definition of Control, all Gc=Cn code points are !Control.
>> Therefore a grapheme cluster can be a Cn followed by any number of
>> Grapheme_Extends
>>
>>
>>> Grapheme_Extend, on the other hand, is exactly equivalent to
>>> Grapheme_Cluster_Break=Extend.
>>>
>>>
>>
>

Re: Questions about UAX #29

Reply via email to