Since Jukka seemed to take issue with my responding to his proffered definitions by instead bringing up an analogy between "life" and "character", I'll try responding
directly to the attempted clarifications.

On 7/13/2011 12:45 AM, Jukka K. Korpela wrote:
That’s a completely different issue. The purpose of definitions and consistent use of terms is not to set guidelines for decisions. It must be possible to say that a particular text character is not a Unicode character without implying (as a naturalistic fallacy of a kind) that it should be.

UTC members say stuff like that all the time, without confusion or ambiguity:

"The characters of the Tangut script are not yet encoded in the Unicode Standard."


The entire “definition” of the word “character” in the Unicode Glossary is highly confusing, and so is “abstract character.”

"Abstract character" is deliberately aligned with the longstanding SC2 normative definition
of "character". See 10646:

"character: member of a set of elements used for the organization, control, or
representation of textual data"

That goes way back in the history of SC2. Back to the 8859 series before 10646, and then back to ISO 2022 before that. There is little point in revising that, as it would only introduce
a disconnect (and the potential for more confusion).

"character (2)" in the glossary is simply a synonym for "abstract character". It is what people talk about when they are talking character encoding theory, as opposed to
what particular entities are encoded in a particular character encoding.

"character (1)" in the glossary is what you are defining below as "text character" -- it is an element of writing, considered independently of any considerations of character encoding.

"character (3)" in the glossary is what you are defining below as "Unicode character". Since not all "abstract characters" are actually encoded in the Unicode standard, nor are all "text characters", we need some concept of "characters that are encoded in the Unicode Standard". And when the context of Unicode is already implied, that is almost always
what "character" means, in the documentation or the discussion.

They would perhaps best be replaced by the following:

Now, as to your particular suggestions:


Unicode character. A Unicode code point classified to be a character code point. It may represent a text character, a component of a text character (such as an accent symbol), or a control code for text formatting.

"a character code point" is an undefined term here. We can talk about assigning a code point to a character (1). If we do so, then that that character becomes an "encoded character" (q.v. in the glossary). If that assignation occurs in the Unicode Standard, then it becomes a "Unicode encoded character". "Unicode character" is our general shorthand for "Unicode encoded character", and we often shorten it just to "character (3)", because most
of the time it is assumed we are talking about Unicode encoded characters.

"component of a text character" is another undefined term here. It begs questions of graphology: why this "component", and not that "component", and what is a "component"
anyway?

"It may represent", rather than clarifying, actually muddies the definitional context here.

Definitionally, a "Unicode encoded character" is an association between a particular (Unicode) code point and a particular abstract character. What that abstract character
itself then represents is beside the point.


Text character. An element of writing recognized as a basic unit of text, such as a letter, digit, punctuation mark, currency symbol, a syllable symbol in syllabic writing, or an ideograph. This is a non-technical definition, and there are differences in how people mentally divide text into text characters or recognize different graphic symbols as forms of a text character or as separate text characters. A text character is usually representable as a Unicode character or as a sequence of Unicode characters.

This definition has problems because it introduces a new term "text character" that ordinary people don't actually use, for what ostensibly is the ordinary, non-technical usage of the term "character". It is also itself potentially ambiguous between
the intended (but awkward) sense of "text[ual] {attributive} character" and
"character [in or of the] text".

A preferable approach, in my opinion, is to default to the writing-system-specific terms for units, when talking about these things: letters, syllables, sinograms, aksaras, ligatures, etc., or the pieces: accent marks, strokes, radicals, components, jamos, etc. If one wants a technical cover term for such things, grapheme comes to mind, but if trying to explain things to the general public, "things that people think of as characters" is
the workaround we usually apply.


Character. A Unicode character or a text character. Normally the context makes it clear which one is meant. In the Unicode Standard, “character” normally means “Unicode character.”

Actually, I think this would contribute to the naturalistic fallacy you cited above.

One of the biggest problems that the character encoding committees face is
the assumption by those new to the encoding process that once a
"character" has been identified by a proposal ("X is a character in my
writing system"), that inexorably implies that it should be encoded as
a "character" in Unicode. When of course, then identification of a "character"
in that sense (what the user or community thinks of as a character) is only
the first step in the analysis as to whether the entity in question is an
appropriate abstract character, and then further, as to whether that abstract
character, once clearly identified, actually should be encoded (as a single
"Unicode encoded character").


(I’m sure this would need clarifications and tuning. I presented it mainly to illustrate that clarity is possible.)

And what I've indicated are some of the reasons why I think fiddling further with the definition(s) of "character" is likely to lead to further problems, rather than self-evidently
improve the situation.

--Ken



Reply via email to