12.07.2011 19:57, Asmus Freytag wrote:

Jukka,

reminding everyone of the definition of "technical term" as opposed to a
word in everyday language isn't helping address the underlying issue.
Everyone is familiar with this distinction.

I’m afraid the distinction is not widely enough known, and even people who know it often fail to apply it. Most people that I know use the word “term” or its equivalent to denote any word, though usually with an overtone that suggests some “tech stuff.”

The truism goes like this: "A character is what character encodings encode".

That’s not a very exact formulation, but a good start. Unicode has the concept of a code point, and a classification of code points so that some of them are classified as characters, or denoting characters. The concept of “character” in that sense is essential, the most important “character” concept in Unicode. So in good terminology, a single term is assigned to that concept, and the term consistently means just that concept and nothing else.

The trouble starts with the observation that the concept does not fully correspond to the age-old concept of “character,” which predates Unicode and computers, by thousands of years. Such problems are not rare in the modern world. You can solve them either by using a common word as a technical term, as long as you continuously keep it clear whether you are using it as a common word or as a term, or by coining a new word, or sequence of words, or abbreviation.

The Unicode standard mostly uses “character” as the technical term, but it makes frequent use of “character” as a common word, too, though usually the prefixing it with the adjective “abstract” (as if Unicode characters weren’t abstract!).

Historically, character encodings have also encoded, on otherwise equal
footing, units that are intended for device control. Over time, some of
the device control characters have been redefined as indicators of
logical division of text. (TAB and LF are the most prominent examples of
this evolution).

Besides, space “characters” might not be seen as characters in the common-language sense. They are somewhere between “graphic characters” and “control characters.”

This is part of the complexity of the correspondence between Unicode characters and text characters (i.e., characters in the old everyday sense): 1) some Unicode characters are not text characters but e.g. formatting controls 2) some text characters cannot be represented as Unicode characters at all except as Private Use characters 3) some text characters need to be represented as a sequence of two or more Unicode characters (or as Private Use characters) 4) many text characters have alternative representations as Unicode characters 5) it is often not self-evident at all how a character used in text could or should be represented using Unicode characters, and many notes in the Unicode Standard are meant to clarify such things.

These historical developments have left us with this and other examples
of deep ambiguities in the definition of the members of those sets we
call "character encodings".

Ambiguities may exist, but this is basically a matter of distinguishing two concepts from each other.

Let's look at the putative benefit of a better definition. I think such
a benefit has implicitly been claimed to exist, but I would ask for a
demonstration in this case.

For one thing, defining “Unicode character” as a technical term and using it consistently makes it possible to formulate clearly its relation to “character” in the common meaning, thereby helping people to understand and use Unicode better.

One possible benefit of a solid definition of the members of a set is in
helping decide which additional entities should be made members of the
set.

That’s a completely different issue. The purpose of definitions and consistent use of terms is not to set guidelines for decisions. It must be possible to say that a particular text character is not a Unicode character without implying (as a naturalistic fallacy of a kind) that it should be.

The entire “definition” of the word “character” in the Unicode Glossary is highly confusing, and so is “abstract character.” They would perhaps best be replaced by the following:

Unicode character. A Unicode code point classified to be a character code point. It may represent a text character, a component of a text character (such as an accent symbol), or a control code for text formatting.

Text character. An element of writing recognized as a basic unit of text, such as a letter, digit, punctuation mark, currency symbol, a syllable symbol in syllabic writing, or an ideograph. This is a non-technical definition, and there are differences in how people mentally divide text into text characters or recognize different graphic symbols as forms of a text character or as separate text characters. A text character is usually representable as a Unicode character or as a sequence of Unicode characters.

Character. A Unicode character or a text character. Normally the context makes it clear which one is meant. In the Unicode Standard, “character” normally means “Unicode character.”

(I’m sure this would need clarifications and tuning. I presented it mainly to illustrate that clarity is possible.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Reply via email to