Re: PRI #200: Draft UTR #49, Unicode Character Categories

Andrew West Thu, 14 Jul 2011 02:57:27 -0700

On 14 July 2011 00:03,  <[email protected]> wrote:
> The Unicode Technical Committee has posted a new issue for public review and
> comment. Details are on the following web page:
>
> PRI #200    Draft UTR #49: Unicode Character Categories
>
> This document presents an approach to the categorization of Unicode
> characters, and documents data files that implementers can use for defining
> and labeling Unicode character categories.


==General Rant==

I like the idea of categorizing characters hierarchically, but any
categorization scheme is necessarily subjective to a greater or lesser
degree, and I do not think that the Unicode Consortium should be
pushing one particular hierarchical categorization model as the
definitive categorization of Unicode characters.  It seems to me that
this is one of several recent expansions to the scope of Unicode
Character Database (ScriptExtensions.txt is another example) that are
neither necessary nor particularly helpful.

==Specific Comment==

There are 18 top-level categories:
[Control]
[Diacritic]
[Format]
[Hieroglyph]
[Ideogram]
[Ideograph]
[Letter]
[Logogram]
[Logograph]
[Mark]
[Number]
[Punctuation]
[Sign]
[Syllable]
[Symbol]
[Virama]
[Vowel]
[Word]

What are the differences between [Ideograph] and [Ideogram], and
between [Logograph] and [Logogram] ?  Even if UTR #49 does give
distinctly different definitions for each of these four top-level
categories, it will not be obvious to most users of Categories.txt
what the difference between Ideograph and Ideogram and between
Logograph and Logogram is as the -graph/-gram versions are synonymous
in general use:

<http://en.wikipedia.org/wiki/Logogram>
<http://en.wikipedia.org/wiki/Ideogram>

Andrew

Re: PRI #200: Draft UTR #49, Unicode Character Categories

Reply via email to