2011-07-11 21:57, Ken Whistler wrote:

On 7/10/2011 4:58 PM, Ernest van den Boogaard wrote:
For the long term, I suggest Unicode should aim for this:

Unicode 6.5 should claim: There will be a *Unicode dictionary*,
limiting and reducing ambiguous semantics within Unicode
(Background: e.g. the word "character" will have one single crisp
definition, /or/ can be specified to & at any special point).

That kind of terminological purity isn't going to occur.

That's possible, even probable, if people who could do the clarification don't want to do it.

> The word "character" has been
used ambiguously for decades in the IT industry, and has other general
language usage as well.

So do many other words, too. Terminology isn't about changing the meanings of words in everyday language. It's about defining terms, perhaps using common-language words but assigning technical meanings to them.

The Unicode Consortium has a glossary of terms:

http://www.unicode.org/glossary/

Yes, and it's mostly useful and well-written. But the "definition" for character is really a mess. For example, "(1) The smallest component of written language that has semantic value" doesn't make sense. What is the semantic value of the letter "e"? Does that definition answer the question whether "é" is one character or two?

"Abstract character" is even worse. "A unit of information used for the organization, control, or representation of textual data." So a bit is a character, isn't it?

But it is basically hopeless to try to legislate away linguistic
ambiguity in a term like "character".

You're not referring to "character" as a term; rather, as a word in English.

I think part of the problem is that Unicode has widely been misrepresented as providing a unique number (code point) for every character (see e.g. http://www.unicode.org/standard/WhatIsUnicode.html ), and it is difficult to take back such statements - which are an important part of Unicode evangelism. We can keep saying it only if the word "character" is used loosely enough. The statement is effectively a truism: Unicode has a unique number for every code point designated as a character code point (and for other code points, too, of course).

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Reply via email to