Re: About Encoding Theory (was: Re: Again not about Phoenician)

Peter Kirk Tue, 09 Nov 2004 11:08:40 -0800

On 09/11/2004 02:30, Kenneth Whistler wrote:

Peter Kirk suggested:
I am suggesting that the best way to get the job done properly is to lay the conceptual foundation properly first, instead of trying to build a structure on a foundation which doesn't match...

Part of the problem that I think some people are having here, including Peter, is that they are ascribing the wrong level to the Unicode Standard itself.

Maybe. But why is this? Is it because the Standard describes itself misleadingly? Is it because it has been oversold? Is it because people who are looking for a conceptual framework look to the text of the Standard, and think they have found one there when in fact what they find is something different?

For example, a professor described on this list as one of the most famous in his field wrote that each of the proposers and supporters of a script proposal "either does not understand Unicode or (and probably "and") does not understand what a glyph is" (quoted on this list in May this year). Implicitly his criticism applies even to the majority of UTC members who accepted the proposal. Was he being unreasonable? What was his basis for claiming to understand Unicode better than the UTC members? I can't speak for the professor, but I would suppose that his claim to understand Unicode is based to a large extent on his reading of the Standard, and explanations from others who have read it. If this professor, a leading expert in his field, is finding such inconsistencies, and as a result of them is slandering the UTC and rejecting Unicode, doesn't this suggest that there is something wrong?

...
The Unicode Standard is *NOT* a standard for the theory or process of character encoding. It does not spell out the rules whereby character encoding committees are constrained in their process, nor does it lay down specifications that would allow anyone to follow some recipe in determining what "thing" is a separate script and what is not, nor what "entity" is an appropriate candidate for encoding as a character and what is not.

It does not normatively specify such things, agreed. But it does appear to describe them, at least in outline, in its informative section entitled "Unicode Design Principles". And these outline descriptions are misleading. All I am asking is that the misleading text be adjusted so that it is not misleading and is consistent with the actual practice of the UTC. I have proposed one way to do so. You may prefer another way, perhaps something like replacing "Characters are the abstract representations of the smallest components of written language that have semantic value." on p.15 by "... the smallest components of written language which have been determined by the character encoding committees to be usefully distinguishable." That may be too obviously ad hoc, but at least it stops people trying to interpret "semantic value" as something of theoretical significance.

... Even *cataloging* the world's writing systems is immensely controversial -- let alone trying to hammer some significant set of "historical nodes" into a set of standardized encoded characters that can assist in digital representation of plain text content of the world's accumulated and prospective written heritage.

Indeed. But if such a standardised set is to be generally acceptable, the controversies have to be resolved, and they should be resolved by open discussion and diplomatic decision-making, not by imposition of one view and accusations that those who hold other views are not "reasonable".

Contrary to what Peter is suggesting, I think it is putting the cart before the horse to expect a standard theory of script encoding to precede the work to actually encode characters for the scripts of the world.

Well, a standard theory is more than what I was asking for. I was looking for an accurate summary description of the criteria currently being used; or failing that, at least deletion of the current inaccurate description.

The Unicode Standard will turn out the way it does, with
all its limitations, warts, and blemishes, because of a
decades-long historical process of decisions made by
hundreds of people, often interacting under intense pressure.
Future generations of scholar will study it and point out its errors.

Future generations of programmers will continue to use it as a basis for information processing, and will continue to program around its limitations.

I agree, of course, that Unicode will not be perfect. But that is not an argument not to do the best job we can do now. Future scholars will have fewer errors to point out if when present-day scholars point out supposed errors in proposals they are listened to and not told things like "I can't say that I care a fig". And future programmers will have fewer limitations to program around, at great expense, if more care is taken to avoid defining and stabilising such limitations. Anyway, what is the great hurry? There may be one with certain modern scripts, but I don't see much urgency with historic scripts. Just listening more and taking more care will help to put off the inevitable *THEN* when Unicode has to be replaced.

And I expect that *THEN* a better, comprehensive theory of script and symbol encoding for information processing will be developed. And some future generation of information technologists will rework the Unicode encoding into a new standard of some sort, compatible with then-existing "legacy" Unicode practice, but avoiding most of the garbage, errors, and 8-bit compatibility practice that we currently have to live with, for hundreds of accumulated (and accumulating) reasons.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: About Encoding Theory (was: Re: Again not about Phoenician)

Reply via email to