RE: Concise term for non-ASCII Unicode characters

Peter Constable Mon, 21 Sep 2015 17:20:06 -0700

From: Unicode [mailto:[email protected]] On Behalf Of Sean Leonard
Sent: Monday, September 21, 2015 1:22 AM


> Well what I am getting at is that when writing standards documents in various 
> SDOs (or any other
> computer science text, for that matter), it is helpful to identify these 
> characters/code points.

[snip]

> However, in contexts where ASCII is getting extended or supplemented (e.g., 
> in the DNS or in e-mail), 
> one needs to be really > clear that the octets 0x80 - 0xFF are Unicode 
> (specifically UTF-8, I suppose), 
> and not something else.

Well, if you are writing standards that "extend ASCII", then you need to be 
completely clear that what is being discussed is _not ASCII_. In that sense, I 
agree with Tony Jollans comments: be clear about what it is that is being 
discussed — including what coded character set, or what encoding form for what 
coded character set.


> FWIW, the term "non-ASCII" is used in e-mail address internationalization 
> ("EAI") in the IETF; its 
> opposite is "all-ASCII" (or simply "ASCII"). (RFCs 6530, 6531, 6532). The 
> term also appears in RFC 
> 2047 from November 1996 but there it has the more expansive meaning (i.e., 
> not limited or 
> targeted to Unicode).

Glancing at the Introduction for RFC 6530, it seems to have clear terminology:

" Without the extensions specified in this document, the mailbox name is 
restricted to a subset of 7-bit ASCII [RFC5321].  Though MIME [RFC2045] enables 
the transport of non-ASCII data..."

Here, "ASCII" means ASCII — the 7-bit encoding originally defined as ANSI X3.4. 
And "non-ASCII data" appears to mean data involving any characters other than 
those in the ASCII coded character set, or any data represented in any other 
encoded representation but ASCII. The term "all-ASCII" is used in section 4.2, 
but it is immediately defined: 

"In this document, an address is "all-ASCII", or just an "ASCII address", if 
every character in the address is in the ASCII character repertoire [ASCII]; an 
address is "non-ASCII", or an "i18n-address", if any character is not in the 
ASCII character repertoire."

So, it seems like they had a similar terminology need to what you describe, and 
the handled it in a satisfactory, clear way.


If what you need to describe is UTF-8 sequences of two or more bytes, then I 
would be clear that the context is Unicode UTF-8, not ASCII or any other coded 
character set / encoding form; and I would say, "Unicode UTF-8 code unit 
sequences of two to four bytes" or "Unicode UTF-8 multi-byte sequences" or 
something along those lines.

If you think it's a serious problem that there isn't one conventional term for 
"characters outside the ASCII repertoire" or "UTF-8 multi-code-unit encoded 
representations" (since different authors could devise different terminology 
solutions), then I suggest you submit a document to UTC explaining why it's a 
problem, documenting inconsistent or unclear terminology that's been used in 
some standards / public specifications, and requesting that Unicode formally 
define terminology for these concepts. I can't guarantee that UTC will do it, 
but I can predict with confidence that it _won't_ do anything of that nature if 
nobody submits such a document.



Peter

RE: Concise term for non-ASCII Unicode characters

Reply via email to