Re: Concise term for non-ASCII Unicode characters

Sean Leonard Mon, 21 Sep 2015 01:30:08 -0700

First of all, thank you all for the responses thus far.


On 9/20/2015 5:51 PM, Martin J. Dürst wrote:

Hello Sean,

On 2015/09/20 23:48, Sean Leonard wrote:
What is the most concise term for characters or code points
So we already have two different things we might need a term for.

[...]
The terms "supplementary character" and "supplementary code point" are
defined in the Unicode standard, referring to characters or code points
above U+FFFF. I am looking for something like those, but for characters
or code points above U+007F.
Anyway, what I wanted to show is that depending on what you need itfor, there are so many different variations that it doesn't pay off tocreate specific short terms for all of them, and the term you usecurrently may be short enough.

Well what I am getting at is that when writing standards documents invarious SDOs (or any other computer science text, for that matter), itis helpful to identify these characters/code points.

I think we can limit our inquiry to "characters" and "code points". Bothof those are well-defined in Unicode (see<http://unicode.org/glossary/>). A [Unicode] code point is any value inthe range 0 - 0x10FFFF. A [Unicode] character is an abstract characterthat is actually assigned a [Unicode] scalar value. Therefore the spaceis Unicode code point > Unicode scalar value > Unicode character.


"supplementary" means outside the BMP, i.e., 0x10000 - 0x10FFFF.
"BMP" means inside the Basic Multilingual Plane, i.e., 0x0 - 0xFFFF.

The problem is that the BMP / supplementary distinction makes sense in aUCS-2 / UTF-16 universe. But for much interchange these days, UTF-8 isthe way to go.

I wish that "non-ASCII characters" and "non-ASCII code points" (andnon-ASCII scalar values) were sufficient for me. Maybe they can be.However, in contexts where ASCII is getting extended or supplemented(e.g., in the DNS or in e-mail), one needs to be really clear that theoctets 0x80 - 0xFF are Unicode (specifically UTF-8, I suppose), and notsomething else.

The expressions "beyond [...] ASCII" or "beyond the ASCII range" (as in,characters beyond ASCII, code points beyond ASCII) have some support inthe Unicode Standard; see, e.g., Section 2.5 "ASCII Transparency"paragraph. Additionally as Peter stated, an expression including "BasicLatin block" (e.g., characters beyond the Basic Latin block) could work.

FWIW, the term "non-ASCII" is used in e-mail addressinternationalization ("EAI") in the IETF; its opposite is "all-ASCII"(or simply "ASCII"). (RFCs 6530, 6531, 6532). The term also appears inRFC 2047 from November 1996 but there it has the more expansive meaning(i.e., not limited or targeted to Unicode).


Sean

Re: Concise term for non-ASCII Unicode characters

Reply via email to