About Encoding Theory (was: Re: Again not about Phoenician)

Kenneth Whistler Mon, 08 Nov 2004 18:51:26 -0800

Peter Kirk suggested:

> I am suggesting that the best way to get the job done properly is to lay
> the conceptual foundation properly first, instead of trying to build a
> structure on a foundation which doesn't match...


Part of the problem that I think some people are having here,
including Peter, is that they are ascribing the wrong level
to the Unicode Standard itself.

The Unicode Standard is a character encoding standard. What
it standardizes are the numerical codes for representing
abstract characters (plus quite a number of related things
having to do with character properties and algorithms for
manipulating characters in text to do various things).

The Unicode Standard is *NOT* a standard for the theory
or process of character encoding. It does not spell out
the rules whereby character encoding committees are
constrained in their process, nor does it lay down
specifications that would allow anyone to follow some
recipe in determining what "thing" is a separate script
and what is not, nor what "entity" is an appropriate
candidate for encoding as a character and what is not.

Ultimately, *those* kinds of determinations are made by
the character encoding committees, based on argumentation
made in proposals, by proponents and opponents, and in
the context of legacy practice, potential cost/benefit
tradeoffs for existing and prospective implementations,
commitments made to stability, and so on. They don't consist
of the encoding committees -- either one of them --
turning to the Unicode Standard, page whatever, or
ISO/IEC 10646, page whatever, to find the rule which
determines what the answer is. In fact the answers evolve
over time, because the demands on the standard evolve,
the implementations evolve, and the impact of the dead
hand of legacy itself changes over time.

It is all fine and good for people to point out the dynamic
nature of scripts themselves -- their historic connections
and change over time, which often make determinations whether
to encode particular instantiations at particular times in
history as a "script" in the character encoding standard
notably difficult.

But I would suggest that people bring an equivalently
refined historical analysis to the process of character
encoding itself. We are dealing with a *very* complex set
of conflicting requirements here for the UCS, and attempting
a level of coverage over the entire history of writing
systems in the world. Even *cataloging* the world's
writing systems is immensely controversial -- let alone
trying to hammer some significant set of "historical nodes"
into a set of standardized encoded characters that can
assist in digital representation of plain text content
of the world's accumulated and prospective written heritage.

Contrary to what Peter is suggesting, I think it is putting
the cart before the horse to expect a standard theory of
script encoding to precede the work to actually encode
characters for the scripts of the world.

The Unicode Standard will turn out the way it does, with
all its limitations, warts, and blemishes, because of a
decades-long historical process of decisions made by
hundreds of people, often interacting under intense pressure.

Future generations of scholar will study it and point out 
its errors.

Future generations of programmers will continue to use it
as a basis for information processing, and will continue
to program around its limitations.

And I expect that *THEN* a better, comprehensive theory of 
script and symbol encoding for information processing will
be developed. And some future generation of information  
technologists will rework the Unicode encoding into a new standard
of some sort, compatible with then-existing "legacy" Unicode
practice, but avoiding most of the garbage, errors, and
8-bit compatibility practice that we currently have to
live with, for hundreds of accumulated (and accumulating)
reasons.

--Ken

About Encoding Theory (was: Re: Again not about Phoenician)

Reply via email to