On 08/11/2004 12:47, Michael Everson wrote:

... Perhaps Ken Whistler and I, in our abundant spare time, might try to wordsmith the standard with regard to this issue. But your insistence that some legalistic interpretation of that text will determine what is and what is not a script is tiresome.

As my spare time may be more abundant than Ken's and yours, I have drafted the following and submitted it to http://www.unicode.org/reporting.html as an Error Report:

Subject: Characters, Scripts and Semantic Distinctions

According to the Unicode Standard 4.0 section 2.2. sub-section "Characters, Not Glyphs", p.15, "Characters are the abstract representations of the smallest components of written language that have semantic value." However (as Michael Everson agrees with me) the distinction between corresponding letters in different scripts is not properly described as "semantic". It is therefore possible to understand this sub-section as implying that this distinction between letters should be treated in Unicode as a glyph distinction rather than a character distinction. This is of course a misunderstanding, because Unicode does in fact encode corresponding letters in different scripts as distinct characters. But this misunderstanding has become widespread and has fuelled a long and acrimonious debate about the proposed Phoenician script. Therefore, to ensure consistency and minimise misunderstandings, the text of this sub-section should be amended to make it clear that corresponding letters in different scripts are considered distinct characters.

I note that the issue is mentioned in passing in a different context on p.19, relating only to cases where there is no graphical distinction between scripts. But a clearer statement in the correct context would be much preferable.

I propose the following text to be added to p.15, after the sentence "They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation.":

"The letters used in natural language text are grouped into scripts, sets of letters which are used together in writing any one language. Letters in different scripts, even when they correspond either semantically or graphically, are represented in Unicode by distinct characters."

I note that this change also impacts a few special cases such as the use of the Latin letters Q and W in Cyrillic script for the Kurdish language. According to the principle clarified here, distinct Cyrillic Q and W characters should be encoded for Kurdish.

I would also suggest a separate definition of "script", a concept which is much used in Chapter 2 of the Standard but nowhere clearly defined. This definition should include a statement of the criteria by which Unicode distinguishes script differences, e.g. between Indic scripts, from graphical differences, e.g. between regular Latin, italic style and Fraktur. The lack of stated criteria for this has also contributed to serious misunderstandings concerning Phoenician.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to