(long) Making orthographies computer-ready (was not Telephoning Tamil)

Addison Phillips [wM] Mon, 29 Jul 2002 14:07:01 -0700

There are always consequences...

... but I am saying that you could build a locale that would work. Generally speaking, 
most programming environments do not look at the Unicode character database for the 
operations in question, or at least, don't look directly that those tables. They use 
custom generated tables or code. For example, from what I know of Java's internal 
structure, it would be relatively easy to construct the necessary classes.

For example, you can create a rule string for RuleBasedCollator that does collation of 
@, since the collator doesn't look at the character properties when performing sorting 
(normalization is another matter, though). A BreakIterator can be fashioned that 
doesn't break on the @ character. Localized strings (as in DateFormat's list of month 
names, for example) are just strings. And so on.

The consequences would generally come into play when you encounter code that DOES look 
at Unicode properties (or looks at a table that is not locale-driven). You'll get 
transient failures in that case.

IOW> the Unicode properties are not just guides. Building "complete Unicode support" 
means taking all the special cases and special pleading into account. Creating a new 
orthography for a minority language should probably take this into account, since what 
one is doing in a small, insular community may be ignored or resisted by Unicode 
implementers, especially if the result cannot be easily fit into existing support 
mechanisms.

The best course of action, if you have the freedom to pursue it, is to choose 
characters that have properties similar to those of the orthographic unit you are 
mapping. "@" has lots of problems (it isn't legal as a "word-part" in a URL, for 
example), it is identified as punctuation (so code that doesn't know about your locale 
may word- or line-break on it), it has no case mapping (so you're at the mercy of 
SpecialCasing, etc.). It is likely that any special cases that you create for ASCII 
characters will be more of an annoyance for Unicode implementers and thus tend not to 
be supported. Avoiding the creation of special cases is a Good Idea.

There are, of course, several orthographies, some with quite large speaker 
populations, that have this potential issue. One that occurs to me might be the 
Khoisan languages of Africa, which I believe commonly use "!" (U+0021) for a click 
sound. This is almost exactly the same problem you are describing for Tongva.

Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you will note that 
almost without exception the entries are locale driven. The first stop in creating a 
new orthography (or computerizing an existing one, perhaps from the days of the 
typewriter), for my money would probably be to get ISO-639 to issue the language a 
2-letter code so you can have locale (and Unicode character database) data tagged with 
it ;-).

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)  
+1 408.210.3569 (mobile)
-------------------------------------------------
Internationalization is an architecture.
It is not a feature. 

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Curtis Clark
> Sent: Friday, July 26, 2002 11:23 PM
> To: [EMAIL PROTECTED]
> Subject: Re: REALLY *not* Tamil - changing scripts (long)
> 
> 
> Addison Phillips [wM] wrote:
>  > Obviously I'm not an expert in these linguistic areas (and hence
>  > rarely comment on them), but it seems to me that the lack of other
>  > mechanisms makes Unicode an attractive target for criticism in this
>  > area.
> 
> Certainly no Unicode-bashing was intended (I'm more of a Unicode 
> evangelist). I guess I'm confused about the use of Unicode character 
> properties. Are you saying that, even though Unicode defines U+0027 as 
> punctuation, other, I could use it as a glottal stop and create a locale 
> that would treat it as a letter (and still be "Unicode compliant", 
> whatever that is?). And if that's the case, are the Unicode properties 
> just guides? Could I develop an orthography where YÃŸÑØ¨Õ±â‹ would be a 
> word, and there would be no consequences?
> 
> -- 
> Curtis Clark                  http://www.csupomona.edu/~jcclark/
> Mockingbird Font Works                  http://www.mockfont.com/
> 
> 
> 
>

(long) Making orthographies computer-ready (was *not* Telephoning Tamil)

Reply via email to

(long) Making orthographies computer-ready (was not Telephoning Tamil)