On 09/11/2003 22:45, Philippe Verdy wrote:
From: "Peter Kirk" <[EMAIL PROTECTED]>
On 09/11/2003 14:55, Philippe Verdy wrote:
...
And canonical normalization _guarantees_ to preserve *only* "starter
sequences" (defective or not), but not necessarily "combining character
sequences" (defective or not), and that's where care must be taken when
encoding text...
Surely not. A combining character sequence consists of an optional base
character followed by one or more combining characters. Canonical
normalisation preserves the sequence of combining characters only,
although it may reorder this sequence. It also preserves without
reordering the juxtaposition of this seuqence to the optional base
character. Therefore the combining character sequence is preserved.
That's where we differ:
The combining character sequence differs from what I define a starter
sequence:
(1) by the fact it can contain more than one class 0 characters (starters),
namely all class 0 combining characters (gc=Mn), and
(2) by the fact that a combining character sequence cannot contain some
class 0 characters (like unagreed PUAs controls and line/paragraph
separators which are treated individually, but not as a combining character
sequence).
The second difference is less critical for us (what it does is that it
creates occurences of defective combining character sequences in the middle
of the text), but the first one is critical here...
This does not affect my argument. A combining character sequence, as
defined, does not perfectly fit your definition "an unordered set of
sequences of characters having the same combining class." But it is
preserved under canonical normalisation. Well, perhaps that depends what
you mean by "preserved". If you mean that its code point representation
is unchanged, that is not true your starter sequences either. If it
means that its semantics are unchanged, it is true by definition of any
string of Unicode characters that its semantics are unchanged by
canonical normalisation, or indeed by any transformation into a
canonically equivalent form.
I still maintain that there's no terminology to designate what I call a
starter sequence.
Agreed. But does it matter? It does so only if this is a meaningful unit
within Unicode. On my understanding, a sequence of combining characters
all of class >0 is meaningful because this is what canonical reordering
operates on. But such a sequence does not necessarily form a unit with
the preceding character.
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/