RE: Concise term for non-ASCII Unicode characters

Tony Jollans Mon, 21 Sep 2015 04:53:54 -0700

As an interested outsider may I suggest that the term "ASCII", indeed the 
concept of ASCII, is only of historical interest and should not be used in any 
modern context. Computing is riddled with terms, "word" being another in 
similar vein, that are used to mean something they are not and would be best 
forgotten.

These days, it is pretty sloppy coding that cares how many bytes an encoding of 
something requires, although there may be many circumstances where legacy 
support is required. You say that, in some contexts, one needs to be really 
clear that the octets 0x80 - 0xFF are Unicode. Either something "is" Unicode, 
or it isn't. Either something uses a recognised encoding, or it doesn't. Using 
these octets to represent Unicode code points is not ASCII, is not UTF-8, and 
is not UCS-2/UTF-16; it could, perhaps, be EBCDIC. Whatever it is, say so 
clearly and explicitly and, if necessary, say why; don't look for some 
mealy-mouthed expression to avoid so saying.

Just my twopenn'orth, and no offence meant, but I can't help thinking you're 
looking for something that shouldn't exist.

Best regards,
Tony Jollans

-----Original Message-----
From: Unicode [mailto:[email protected]] On Behalf Of Sean Leonard
Sent: 21 September 2015 09:22
To: [email protected]
Subject: Re: Concise term for non-ASCII Unicode characters

First of all, thank you all for the responses thus far.

On 9/20/2015 5:51 PM, Martin J. Dürst wrote:
> Hello Sean,
>
> On 2015/09/20 23:48, Sean Leonard wrote:
>> What is the most concise term for characters or code points
>
> So we already have two different things we might need a term for. 

> [...]
>>
>> The terms "supplementary character" and "supplementary code point" 
>> are defined in the Unicode standard, referring to characters or code 
>> points above U+FFFF. I am looking for something like those, but for 
>> characters or code points above U+007F.
> Anyway, what I wanted to show is that depending on what you need it 
> for, there are so many different variations that it doesn't pay off to 
> create specific short terms for all of them, and the term you use 
> currently may be short enough.

Well what I am getting at is that when writing standards documents in various 
SDOs (or any other computer science text, for that matter), it is helpful to 
identify these characters/code points.

I think we can limit our inquiry to "characters" and "code points". Both of 
those are well-defined in Unicode (see <http://unicode.org/glossary/>). A 
[Unicode] code point is any value in the range 0 - 0x10FFFF. A [Unicode] 
character is an abstract character that is actually assigned a [Unicode] scalar 
value. Therefore the space is Unicode code point > Unicode scalar value > 
Unicode character.

"supplementary" means outside the BMP, i.e., 0x10000 - 0x10FFFF.
"BMP" means inside the Basic Multilingual Plane, i.e., 0x0 - 0xFFFF.

The problem is that the BMP / supplementary distinction makes sense in a
UCS-2 / UTF-16 universe. But for much interchange these days, UTF-8 is the way 
to go.

I wish that "non-ASCII characters" and "non-ASCII code points" (and non-ASCII 
scalar values) were sufficient for me. Maybe they can be. 
However, in contexts where ASCII is getting extended or supplemented (e.g., in 
the DNS or in e-mail), one needs to be really clear that the octets 0x80 - 0xFF 
are Unicode (specifically UTF-8, I suppose), and not something else.

The expressions "beyond [...] ASCII" or "beyond the ASCII range" (as in, 
characters beyond ASCII, code points beyond ASCII) have some support in the 
Unicode Standard; see, e.g., Section 2.5 "ASCII Transparency" 
paragraph. Additionally as Peter stated, an expression including "Basic Latin 
block" (e.g., characters beyond the Basic Latin block) could work.

FWIW, the term "non-ASCII" is used in e-mail address internationalization 
("EAI") in the IETF; its opposite is "all-ASCII" 
(or simply "ASCII"). (RFCs 6530, 6531, 6532). The term also appears in RFC 2047 
from November 1996 but there it has the more expansive meaning (i.e., not 
limited or targeted to Unicode).

Sean

RE: Concise term for non-ASCII Unicode characters

Reply via email to