On 9/29/2015 11:50 AM, Ken Whistler wrote:
At any rate, any formal contribution that suggests coming up with terminology for
the #1 and #2 sets should take these considerations under advisement.

The original premise of this thread was (and is) to find the *most concise* term for that range U+0080 - U+10FFFF, regardless of whether that range is for characters, code points, scalar values, or coffee cup icons ☕️. Preferably, such a concise term would have support in the Unicode Standard, or in some other standard. I was not looking for a totally new, invented term, but rather a term that has empirical, standards-based support.

A full survey of the Unicode Standard 8.0 finds that the term "beyond ASCII" has textual support: p. 1 Introduction: While taking the ASCII character set as its starting point, the Unicode Standard goes far
beyond ASCII’s limited ability [...]

p. 37 ASCII Transparency: [UTF-8] maintains transparency for all of the
ASCII code points (0x00..0x7F). That means Unicode code points U+0000..U+007F are
[thus] indistinguishable from ASCII itself. [...] Beyond the ASCII
range of Unicode, many [...] scripts are represented by two bytes [in UTF-8...]

p. 200 Programming Languages: A limitation of the ISO/ANSI C model is its assumption that characters can always be processed in isolation. Implementations that choose to go beyond the ISO/ANSI C model may
find it useful to mix widths within their APIs.
{This formulation is not "beyond ASCII", but uses the preposition "beyond" in the exact same sense, since ASCII is fixed-width and forms an underlying assumption of the ISO/ANSI C model.}

p. 237 Case Mappings: A number of complications to case mappings occur once the repertoire of characters is
expanded beyond ASCII.

p. 677 Han / CJK Unified Ideographs Extension B: The ideographs in the CJK Unified Ideographs Extension B block represent an additional set of 42,711 unified ideographs beyond the 27,496 included in The Unicode Standard, Version 3.0. {This formulation uses the preposition "beyond" in the exact same sense, namely, a subsequent range that is beyond the original range.}
Ditto for Extension C, Extension D, Extension E

Finally, (case) "beyond ASCII" is in the Index at p. 237.


Perhaps this thread would have gone differently if the original subject was "Beyond ASCII" instead of...that other one. 😉

Now, I am not saying that the term *must* be "beyond ASCII". However the term "non-ASCII" (with or without "Unicode") has no support in the Unicode Standard 8.0. The only occurrence is the reference to RFC 2047, and in that document, "non-ASCII" clearly means any and every character encoding ever invented, not specifically Unicode.


Another thing is the oxymoron "ASCII Unicode" (the opposite of "non-ASCII Unicode"). Actually ASCII is a formal subset of Unicode...at the beginning. ASCII itself (ANSI X3.4-1986) is a 7-bit character set; it does not limit itself to any particular word length so long as the 7 bits are in those combinations. Therefore U+0000 - U+007F characters encoded in UTF-32 or UTF-16 are in ASCII codes; they are truly ASCII characters. When a bit combination '?' (0x3F) is loaded into a 64-bit register on a CPU, is it still an ASCII character? My view is yes.

They are not in ASCII *encoding*, as *encoding* is limited to a sequence of 7-bit or 8-bit combinations (X3.4-1986 Section 2.1.1(1)). My point here is that to be correct, one ought to use some sort of preposition, namely "ASCII in Unicode" or "ASCII [characters/code points/scalar values] in Unicode"--but if you slice off "in Unicode", you are left with "ASCII" and that is just fine. This is another basis for the proposition that "beyond ASCII" (e.g., "characters beyond ASCII [in Unicode]", "beyond the ASCII range [of Unicode]") makes sense.

Regards,

Sean

Reply via email to