I am very keen that SQL and XML Query move in the same direction,
based on the:
Character Model for the World Wide Web 1.0
http://www.w3.org/TR/charmod
which is, in turn, based on NFC.
Misha Wolf
W3C I18N WG Chair
On 09/02/2001 17:46:48 Mark Davis wrote:
> The whole principle of tagging individual strings with NF* is a bit odd to me;
> not sure I like it. The K forms in particular are really a folding operation,
> much like casing. I would not expect to find a model where someone tagged
every
> string in a database with its Case, and then had some elaborate system in
every
> function involving strings so that the result of any operation could be
> successfully tagged with Upper or Lower. Seems not well motivated.
>
> As for D vs C, I don't know that there is a huge advantage to tagging, vs.
just
> picking one of them consistently all the time. And I can see many drawbacks in
> having to maintain the tags all the time, and handle mixed operations. There
> are advantages to the W3C approach; just always keep the data in one form.
>
>
> Given the model you have, however, I think Peter's A.c table is well thought
> out. There are implications for any string operation, not just concatenation.
> The operations with a single string (like uppercasing) are fairly
> straightforward: stay in the same form. Substringing too (which may require
> some fixup at the ends), upper/lowercasing, etc.
>
> With multiple strings in a function (not just two), you have to have a
> consistent output. Given the constraints you have, I think Peter's rules are
> good, and can be easily extended:
>
> 1. If all are K, retain the K in the output; otherwise don't.
> 2. If all are D, retain the D in the output; otherwise convert to C.
>
> However, there are very important exceptions; look at binary comparison of
> strings. You do not need to worry about differences in Ks because an NFKC
> string *is* NFC; an NFKD string *is* NFD; no conversion necessary. But to
> preserve transitivity you ALWAYS have to pick either a C or a D; no matter
what
> the input. So you have to (logically at least) choose a single common form, C
> or D. That is, if C is the common form, then comparing an NF*D to an NF*D --
> even though they are the same form -- you *have* to map both to NF*C.
>
> For binary comparison, C is probably the best choice, since it matches more
> data and thus requires less processing. Even though D produces somewhat better
> results, binary comparison will simply not match user expectations anyway --
it
> is more for internal structures, where you need *some* fast, consistent
> ordering but it does not need to be end-user oriented).
>
> Mark
>
> ----- Original Message -----
> From: <[EMAIL PROTECTED]>
> To: "Unicode List" <[EMAIL PROTECTED]>
> Sent: Thursday, February 08, 2001 14:32
> Subject: Re: The normalization form of the result of a dyadic operation.
>
>
> >
> > On 02/08/2001 11:20:27 AM "J M Sykes" wrote:
> >
> > >When an standard conformaing SQL-implementation concatenates two
> > normalized
> > >UCS strings, then it is required that the result be normalized (noting
> > >Unicode Standard Annex #15 Unicode Normalization Forms, Concatenation).
> >
> > Yes. It must be understood that a concatenated string is not guaranteed to
> > be normalised until it is explicitly normalised, regardless of the state of
> > the operand strings.
> >
> >
> >
> > >My question is, supposing the NF of the two operands to be different, what
> > >should be the NF of the result?
> > >
> > >In its present state, our proposal specifies the result by referring to
> > the
> > >following table:
> > >
> > >Table A
> > >=======
> > > |Operand 2
> > > Operand 1 |NFKD NFKC NFD NFC
> > > -----------------+------------------------
> > > NFKD |NFKD NFKC NFD NFC
> > > NFKC |NFKC NFKC NFD NFC
> > > NFD |NFD NFD NFD NFC
> > > NFC |NFC NFC NFC NFC
> > >
> > >It has been suggested that the following would be preferable:
> > >
> > >
> > >Table B
> > >=======
> > > |Operand 2
> > > Operand 1 |NFKD NFKC NFD NFC
> > > -----------------+------------------------
> > > NFKD |NFKD NFKC NFKD NFKC
> > > NFKC |NFKC NFKC NFKD NFKC
> > > NFD |NFKD NFKD NFD NFC
> > > NFC |NFKC NFKC NFC NFC
> >
> >
> >
> >
> > I'm trying to make sense of these tables. Apparently, Table A consistently
> > applies a precedence of NFC > NFD > NFKC > NFKD. (I.e. the form for the
> > result should be the same as that of the operand with the highest form
> > according to this ordering.) Apparently, Table B gives a precedence to K
> > forms (K > ~K), and a precedence to C over D (C > D), but the first
> > ordering (K > ~K) is given higher priority over the second ordering (C >
> > D).
> >
> > Actually, I don't think I'd go for either. Certainly, table B has a
> > concern: precedence given to the compatibility decompositions that occur in
> > NFKD and NFKC -- this results in removing distinctions that, in certain
> > situations, might be important. Table B should only be used with caution.
> >
> > Both tables have an anomalous characteristic: if one operand is NFC, then
> > the result is always to be composed, but if one operand is NFKC and the
> > other is decomposed, then the result goes in two directions depending upon
> > the K or ~K property of the other operand. Why? That seems rather strange
> > to me. If the "Kompatibility" issue is orthogonal to the (de)composition
> > issue (which these tables follow, and which I think makes sense), then I
> > would think either C should always take precedence over D, or vice versa.
> > If we extract a portion from each table (and simpily because the operation
> > is commutative), we find
> >
> > Sub-table A
> > =======
> > |Operand 2
> > Operand 1 |NFKD NFD
> > ----------------+--------------
> > NFKC |NFKC NFD
> >
> > Sub-table B
> > =======
> > |Operand 2
> > Operand 1 |NFKD NFD
> > ----------------+--------------
> > NFKC |NFKC NFKD
> >
> >
> > Tables A and B could have just as readily had
> >
> > Sub-table A.a
> > =======
> > |Operand 2
> > Operand 1 |NFKD NFD
> > ----------------+--------------
> > NFKC |NFKD NFC
> >
> > Sub-table B.a
> > =======
> > |Operand 2
> > Operand 1 |NFKD NFD
> > ----------------+--------------
> > NFKC |NFKD NFKC
> >
> > and I think that wouldn't have been any more or less motivated. It still
> > wouldn't make sense to me, though: I would have expected D to always have
> > precedence over C, as in Tables A.b and B.b:
> >
> > Table A.b
> > =======
> > |Operand 2
> > Operand 1 |NFKD NFKC NFD NFC
> > -----------------+------------------------
> > NFKD |NFKD NFKD NFD NFD
> > NFKC |NFKD NFKC NFD NFC
> > NFD |NFD NFD NFD NFD
> > NFC |NFD NFC NFD NFC
> >
> > Table B.b
> > =======
> > |Operand 2
> > Operand 1 |NFKD NFKC NFD NFC
> > -----------------+------------------------
> > NFKD |NFKD NFKD NFKD NFKD
> > NFKC |NFKD NFKC NFKD NFKC
> > NFD |NFKD NFKD NFD NFD
> > NFC |NFKD NFKC NFD NFC
> >
> > or for C to always take precedence over D, as in Tables A.c and B.c:
> >
> > Table A.c
> > =======
> > |Operand 2
> > Operand 1 |NFKD NFKC NFD NFC
> > -----------------+------------------------
> > NFKD |NFKD NFKC NFD NFC
> > NFKC |NFKC NFKC NFC NFC
> > NFD |NFD NFC NFD NFC
> > NFC |NFC NFC NFC NFC
> >
> > Table B.c
> > =======
> > |Operand 2
> > Operand 1 |NFKD NFKC NFD NFC
> > -----------------+------------------------
> > NFKD |NFKD NFKC NFKD NFKC
> > NFKC |NFKC NFKC NFKC NFKC
> > NFD |NFKD NFKC NFD NFC
> > NFC |NFKC NFKC NFC NFC
> >
> >
> > (What a lot of alternatives!)
> >
> > For the reason described above, I think compatibility decomposition should
> > be avoided if either operand did not use it (i.e. ~K > K). As for C vs. D,
> > I have a personal preference for D over C, but W3G has (with not invalid
> > reasons) chosen NFC as the preferred and recommended normalisation form in
> > any protocols that they create. As a result, I'd be inclined from all these
> > options to select Table A.c. It gives precedence to C over D, and it avoids
> > K unless both operands conform to K.
> >
> >
> >
> > - Peter
> >
> >
> > ---------------------------------------------------------------------------
> > Peter Constable
> >
> > Non-Roman Script Initiative, SIL International
> > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> > Tel: +1 972 708 7485
> > E-mail: <[EMAIL PROTECTED]>
> >
> >
> >
> >
> > - Peter
> >
> >
> > ---------------------------------------------------------------------------
> > Peter Constable
> >
> > Non-Roman Script Initiative, SIL International
> > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> > Tel: +1 972 708 7485
> > E-mail: <[EMAIL PROTECTED]>
> >
> >
> >
> - att1.htm
-----------------------------------------------------------------
Visit our Internet site at http://www.reuters.com
Any views expressed in this message are those of the individual
sender, except where the sender specifically states them to be
the views of Reuters Ltd.