Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

Peter Kirk Thu, 11 Dec 2003 06:18:06 -0800

On 10/12/2003 18:42, Kenneth Whistler wrote:

...

And even then the word "interpretation" needs to be clearly defined, see below.
"Interpretation" has been *deliberately* left undefined. It falls
back to its general English usage, because attempting a
technical definition of "interpretation" in the context of
the Unicode Standard runs too far afield from the intended
area of standardization. The UTC would end up bogged down
in linguistic and semiotic theory attempting to nail this
one down.
What *is* clear is that a "distinction in interpretation of a character or character sequence" cannot be confused, by any careful reader of the standard, with "difference in code point or code point sequence". The latter *is* defined and totally unambiguous in the standard.

Thanks for the clarification. We are again talking at different levels. I am still looking from the point of view of an application programmer interested in a string as an abstract entity (an object or an abstract data type) with a meaning or interpretation, but with no interest in the exact encoding. You are looking at this at a lower level, either of a systems programmer or of an application programmer who is forced to get into this lower level stuff because of inadequate system support at the more abstract level.

...

Well, then please correct your interpretation of interpretation.

<U+00E9> has one code point in it. It has one encoded character in it.

<U+0065, U+0301> has two code points in it. It has two encoded characters in it. The two sequences are distinct and distinguished and distinguishable -- in terms of their code point or character sequences. The two sequences are canonically equivalent. They are not *interpreted* differently, since they both *mean* the same thing -- they are both interpreted as referring to the letter of various Latin alphabets known as "e-acute".

*That* is what the Unicode Standard "means" by canonical equivalence.

Thanks again for the clarification. Again, I am not interested in code point sequences but in meaning. I have been forced to get involved in code point issues when I have found that they have not made the necessary meaning distinctions. But my interest is essentially higher level, which is why I am trying to push all of these non-meaningful distinctions into a low level hidden from my view.

...

If you are operating at a level where the question "is this string
normalised" is meaningless, then you are talking about text
content and not about the level where the conformance requirements
of the Unicode Standard are relevant. No wonder you and others
are confused.

Of course, if I look on a printed page of text and see the word
"caf�" rendered there as a token, it is meaningless to talk about
whether the � is normalized or not. It just is a manifest token
of the letter �, rendered on the page. The whole concept of
Unicode normalization is irrelevant to a user at that level. But
you cannot infer from that that normalization distinctions cannot
be made conformantly in the encoded character stores for
digital representation of text -- which is the relevant field
where Unicode conformance issues apply.

Ken, now you seem to be trying to define out of existence a level at which C7-C9 and probably also C10 (at least the part about canonical-equivalent sequences) are relevant. I accept, because of your explanation above, that there is a lower level at which they are not relevant, because it is concerned with encoded character sequences and not with interpretation. But above that level there is surely a separate level at which interpretation is relevant, and that is not just the level of printed texts outside a computer system. If there isn't such a level, C7-C10 are redundant and meaningless.

At the level I have in mind all kinds of important processes take place within a computer system. Some of these are defined by Unicode, e.g. collation, which is independent of the canonically equivalent form because it starts with normalisation. Others e.g. automatic translation are not defined by Unicode. For all processing at this level "Ideally, an implementation would always interpret two canonical-equivalent character sequences identically" (quote from C9). Rendering is also effectively at this level. And at this level the question "is this string normalised?" is meaningless, because we are looking at the text content and its interpretation, and not at the encoded form. There is of course an encoded form lying behind that text content, but that should be no more the concern of the end user than the UTF form or than the pattern of on and off transistors or magnetic particles in the computer's memory, and it should be hidden from the end user by an API.

...

Standards are not adjudicated by case law. They are not
interpreted by judges. ...

Surely in principle they could be, if there was for example a dispute over fulfilment of a contract which specified that a product must conform to Unicode. But this is a red herring here, I realise.

...

Well, I had stated such things more tentatively to start with, asking for contrary views and interpretations, but received none until now except for Mark's very generalised implication that I had said something wrong (and, incorrectly, that I hadn't read the relevant part of the standard). Please, those of you who do know what is correct, keep us on the right path. Otherwise the confusion will spread.

I'll try. :-)

Thank you, and thank you for giving your time to this issue.

--Ken

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

Reply via email to