On 02/08/2001 11:20:27 AM "J M Sykes" wrote:
>When an standard conformaing SQL-implementation concatenates two
normalized
>UCS strings, then it is required that the result be normalized (noting
>Unicode Standard Annex #15 Unicode Normalization Forms, Concatenation).
Yes. It must be understood that a concatenated string is not guaranteed to
be normalised until it is explicitly normalised, regardless of the state of
the operand strings.
>My question is, supposing the NF of the two operands to be different, what
>should be the NF of the result?
>
>In its present state, our proposal specifies the result by referring to
the
>following table:
>
>Table A
>=======
> |Operand 2
> Operand 1 |NFKD NFKC NFD NFC
> -----------------+------------------------
> NFKD |NFKD NFKC NFD NFC
> NFKC |NFKC NFKC NFD NFC
> NFD |NFD NFD NFD NFC
> NFC |NFC NFC NFC NFC
>
>It has been suggested that the following would be preferable:
>
>
>Table B
>=======
> |Operand 2
> Operand 1 |NFKD NFKC NFD NFC
> -----------------+------------------------
> NFKD |NFKD NFKC NFKD NFKC
> NFKC |NFKC NFKC NFKD NFKC
> NFD |NFKD NFKD NFD NFC
> NFC |NFKC NFKC NFC NFC
I'm trying to make sense of these tables. Apparently, Table A consistently
applies a precedence of NFC > NFD > NFKC > NFKD. (I.e. the form for the
result should be the same as that of the operand with the highest form
according to this ordering.) Apparently, Table B gives a precedence to K
forms (K > ~K), and a precedence to C over D (C > D), but the first
ordering (K > ~K) is given higher priority over the second ordering (C >
D).
Actually, I don't think I'd go for either. Certainly, table B has a
concern: precedence given to the compatibility decompositions that occur in
NFKD and NFKC -- this results in removing distinctions that, in certain
situations, might be important. Table B should only be used with caution.
Both tables have an anomalous characteristic: if one operand is NFC, then
the result is always to be composed, but if one operand is NFKC and the
other is decomposed, then the result goes in two directions depending upon
the K or ~K property of the other operand. Why? That seems rather strange
to me. If the "Kompatibility" issue is orthogonal to the (de)composition
issue (which these tables follow, and which I think makes sense), then I
would think either C should always take precedence over D, or vice versa.
If we extract a portion from each table (and simpily because the operation
is commutative), we find
Sub-table A
=======
|Operand 2
Operand 1 |NFKD NFD
----------------+--------------
NFKC |NFKC NFD
Sub-table B
=======
|Operand 2
Operand 1 |NFKD NFD
----------------+--------------
NFKC |NFKC NFKD
Tables A and B could have just as readily had
Sub-table A.a
=======
|Operand 2
Operand 1 |NFKD NFD
----------------+--------------
NFKC |NFKD NFC
Sub-table B.a
=======
|Operand 2
Operand 1 |NFKD NFD
----------------+--------------
NFKC |NFKD NFKC
and I think that wouldn't have been any more or less motivated. It still
wouldn't make sense to me, though: I would have expected D to always have
precedence over C, as in Tables A.b and B.b:
Table A.b
=======
|Operand 2
Operand 1 |NFKD NFKC NFD NFC
-----------------+------------------------
NFKD |NFKD NFKD NFD NFD
NFKC |NFKD NFKC NFD NFC
NFD |NFD NFD NFD NFD
NFC |NFD NFC NFD NFC
Table B.b
=======
|Operand 2
Operand 1 |NFKD NFKC NFD NFC
-----------------+------------------------
NFKD |NFKD NFKD NFKD NFKD
NFKC |NFKD NFKC NFKD NFKC
NFD |NFKD NFKD NFD NFD
NFC |NFKD NFKC NFD NFC
or for C to always take precedence over D, as in Tables A.c and B.c:
Table A.c
=======
|Operand 2
Operand 1 |NFKD NFKC NFD NFC
-----------------+------------------------
NFKD |NFKD NFKC NFD NFC
NFKC |NFKC NFKC NFC NFC
NFD |NFD NFC NFD NFC
NFC |NFC NFC NFC NFC
Table B.c
=======
|Operand 2
Operand 1 |NFKD NFKC NFD NFC
-----------------+------------------------
NFKD |NFKD NFKC NFKD NFKC
NFKC |NFKC NFKC NFKC NFKC
NFD |NFKD NFKC NFD NFC
NFC |NFKC NFKC NFC NFC
(What a lot of alternatives!)
For the reason described above, I think compatibility decomposition should
be avoided if either operand did not use it (i.e. ~K > K). As for C vs. D,
I have a personal preference for D over C, but W3G has (with not invalid
reasons) chosen NFC as the preferred and recommended normalisation form in
any protocols that they create. As a result, I'd be inclined from all these
options to select Table A.c. It gives precedence to C over D, and it avoids
K unless both operands conform to K.
- Peter
---------------------------------------------------------------------------
Peter Constable
Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>
- Peter
---------------------------------------------------------------------------
Peter Constable
Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>