Jukka,

reminding everyone of the definition of "technical term" as opposed to a word in everyday language isn't helping address the underlying issue. Everyone is familiar with this distinction.

You note that there's a bit of a truism that underlies the definition of character and character encoding, but I would claim this is not limited to Unicode, and has nothing to do with promoting that standard. The truism goes like this: "A character is what character encodings encode".

As such, "character" also becomes the smallest unit on which algorithms for processing textual data operate.

Historically, character encodings have also encoded, on otherwise equal footing, units that are intended for device control. Over time, some of the device control characters have been redefined as indicators of logical division of text. (TAB and LF are the most prominent examples of this evolution).

These historical developments have left us with this and other examples of deep ambiguities in the definition of the members of those sets we call "character encodings". These ambiguities are reflected in the technical (as opposed to everyday) usage of the term "character". I fully agree with Ken that you can't "fix" this situation be definitional fiat.

Let's look at the putative benefit of a better definition. I think such a benefit has implicitly been claimed to exist, but I would ask for a demonstration in this case.

One possible benefit of a solid definition of the members of a set is in helping decide which additional entities should be made members of the set. Can there be a definition of "character" that provides a solid guidepost for evaluating future proposed character additions to the standard?

Over twenty years of work on the Unicode Standard (and decades of work on earlier standards) have clearly demonstrated that it is impossible to devise an "algorithm" for deciding the question of what candidates are worthy for being encoded in Unicode (or any other character encoding).

The problem goes back to the incredible diversity of writing systems and notations and their use. It is further complicated by the fact that breaking down a writing system into elements (identifying the characters) can quite often be done in more than one way. In many instances it's not even obvious which method is the "best" in a given circumstance. Attempts to base this process on mechanistic rules (driven by definitions) are bound to fail.

Hence, "characters" are the outcome of a creative (human) process of analyzing writing systems. Once you have made a particular analysis, usually ending in an encoding, the elements thus defined are "de facto" the "characters".

If you were to accept that it is impossible to rigorously define characters for purposes of making this analysis, the problem becomes simpler. "Abstract" characters are then entities encoded in one (or more) character encodings, and "character" is what character encodings encode. Operationally, characters are "the smallest units operated on by algorithms that process textual data".

"Operated on" would sidestep the distinctions between characters that represent elements of a writing system like "A" and what Unicode calls format controls like "RLM" (or the segmentation characters like "PS", "LF", "TAB").

A bit is not the smallest unit, because the algorithms (as logically described) don't operate on bits, they are defined in terms of characters (or sequences of characters).

For a fuller definition you might need to make clear that "display" is covered by "process" and you might find you need to find a way to cover the traditional use of control characters. They could be described the smallest units operated on by algorithms that control of devices displaying text based on data embedded in a text stream.

While there might be some improvement in rewording the glossary entries in this way, doing so neither removes the inherent tautology nor does it eliminate the fact that characters are very diverse in what they represent.

But it might make clear that no definition of "character" will ever be sufficient to serve as input to the process of deciding the question of whether a proposed new entity is or isn't a character.

A./

Reply via email to