Re: Definition of character

Jukka K. Korpela Wed, 13 Jul 2011 01:04:38 -0700

12.07.2011 19:57, Asmus Freytag wrote:

Jukka,


reminding everyone of the definition of "technical term" as opposed to a
word in everyday language isn't helping address the underlying issue.
Everyone is familiar with this distinction.

I’m afraid the distinction is not widely enough known, and even peoplewho know it often fail to apply it. Most people that I know use the word“term” or its equivalent to denote any word, though usually with anovertone that suggests some “tech stuff.”

The truism goes like this: "A character is what character encodings encode".

That’s not a very exact formulation, but a good start. Unicode has theconcept of a code point, and a classification of code points so thatsome of them are classified as characters, or denoting characters. Theconcept of “character” in that sense is essential, the most important“character” concept in Unicode. So in good terminology, a single term isassigned to that concept, and the term consistently means just thatconcept and nothing else.

The trouble starts with the observation that the concept does not fullycorrespond to the age-old concept of “character,” which predates Unicodeand computers, by thousands of years. Such problems are not rare in themodern world. You can solve them either by using a common word as atechnical term, as long as you continuously keep it clear whether youare using it as a common word or as a term, or by coining a new word, orsequence of words, or abbreviation.

The Unicode standard mostly uses “character” as the technical term, butit makes frequent use of “character” as a common word, too, thoughusually the prefixing it with the adjective “abstract” (as if Unicodecharacters weren’t abstract!).

Historically, character encodings have also encoded, on otherwise equal
footing, units that are intended for device control. Over time, some of
the device control characters have been redefined as indicators of
logical division of text. (TAB and LF are the most prominent examples of
this evolution).

Besides, space “characters” might not be seen as characters in thecommon-language sense. They are somewhere between “graphic characters”and “control characters.”

This is part of the complexity of the correspondence between Unicodecharacters and text characters (i.e., characters in the old everyday sense):1) some Unicode characters are not text characters but e.g. formattingcontrols2) some text characters cannot be represented as Unicode characters atall except as Private Use characters3) some text characters need to be represented as a sequence of two ormore Unicode characters (or as Private Use characters)4) many text characters have alternative representations as Unicodecharacters5) it is often not self-evident at all how a character used in textcould or should be represented using Unicode characters, and many notesin the Unicode Standard are meant to clarify such things.

These historical developments have left us with this and other examples
of deep ambiguities in the definition of the members of those sets we
call "character encodings".

Ambiguities may exist, but this is basically a matter of distinguishingtwo concepts from each other.

Let's look at the putative benefit of a better definition. I think such
a benefit has implicitly been claimed to exist, but I would ask for a
demonstration in this case.

For one thing, defining “Unicode character” as a technical term andusing it consistently makes it possible to formulate clearly itsrelation to “character” in the common meaning, thereby helping people tounderstand and use Unicode better.

One possible benefit of a solid definition of the members of a set is in
helping decide which additional entities should be made members of the
set.

That’s a completely different issue. The purpose of definitions andconsistent use of terms is not to set guidelines for decisions. It mustbe possible to say that a particular text character is not a Unicodecharacter without implying (as a naturalistic fallacy of a kind) that itshould be.

The entire “definition” of the word “character” in the Unicode Glossaryis highly confusing, and so is “abstract character.” They would perhapsbest be replaced by the following:

Unicode character. A Unicode code point classified to be a charactercode point. It may represent a text character, a component of a textcharacter (such as an accent symbol), or a control code for text formatting.

Text character. An element of writing recognized as a basic unit oftext, such as a letter, digit, punctuation mark, currency symbol, asyllable symbol in syllabic writing, or an ideograph. This is anon-technical definition, and there are differences in how peoplementally divide text into text characters or recognize different graphicsymbols as forms of a text character or as separate text characters. Atext character is usually representable as a Unicode character or as asequence of Unicode characters.

Character. A Unicode character or a text character. Normally the contextmakes it clear which one is meant. In the Unicode Standard, “character”normally means “Unicode character.”

(I’m sure this would need clarifications and tuning. I presented itmainly to illustrate that clarity is possible.)


--
Yucca, http://www.cs.tut.fi/~jkorpela/

Re: Definition of character

Reply via email to