Beyond ASCII

Sean Leonard Tue, 29 Sep 2015 23:18:07 -0700

On 9/29/2015 11:50 AM, Ken Whistler wrote:

At any rate, any formal contribution that suggests coming up withterminology for
the #1 and #2 sets should take these considerations under advisement.

The original premise of this thread was (and is) to find the *mostconcise* term for that range U+0080 - U+10FFFF, regardless of whetherthat range is for characters, code points, scalar values, or coffee cupicons ☕️. Preferably, such a concise term would have support in theUnicode Standard, or in some other standard. I was not looking for atotally new, invented term, but rather a term that has empirical,standards-based support.

A full survey of the Unicode Standard 8.0 finds that the term "beyondASCII" has textual support:p. 1 Introduction: While taking the ASCII character set as its startingpoint, the Unicode Standard goes far

beyond ASCII’s limited ability [...]

p. 37 ASCII Transparency: [UTF-8] maintains transparency for all of the

ASCII code points (0x00..0x7F). That means Unicode code pointsU+0000..U+007F are

[thus] indistinguishable from ASCII itself. [...] Beyond the ASCII

range of Unicode, many [...] scripts are represented by two bytes [inUTF-8...]

p. 200 Programming Languages: A limitation of the ISO/ANSI C model isits assumption that characters can always be processed in isolation.Implementations that choose to go beyond the ISO/ANSI C model may

find it useful to mix widths within their APIs.

{This formulation is not "beyond ASCII", but uses the preposition"beyond" in the exact same sense, since ASCII is fixed-width and formsan underlying assumption of the ISO/ANSI C model.}

p. 237 Case Mappings: A number of complications to case mappings occuronce the repertoire of characters is

expanded beyond ASCII.

p. 677 Han / CJK Unified Ideographs Extension B: The ideographs in theCJK Unified Ideographs Extension B block represent an additional set of42,711 unified ideographs beyond the 27,496 included in The UnicodeStandard, Version 3.0.{This formulation uses the preposition "beyond" in the exact same sense,namely, a subsequent range that is beyond the original range.}

Ditto for Extension C, Extension D, Extension E

Finally, (case) "beyond ASCII" is in the Index at p. 237.

Perhaps this thread would have gone differently if the original subjectwas "Beyond ASCII" instead of...that other one. 😉

Now, I am not saying that the term *must* be "beyond ASCII". However theterm "non-ASCII" (with or without "Unicode") has no support in theUnicode Standard 8.0. The only occurrence is the reference to RFC 2047,and in that document, "non-ASCII" clearly means any and every characterencoding ever invented, not specifically Unicode.

Another thing is the oxymoron "ASCII Unicode" (the opposite of"non-ASCII Unicode"). Actually ASCII is a formal subset of Unicode...atthe beginning. ASCII itself (ANSI X3.4-1986) is a 7-bit character set;it does not limit itself to any particular word length so long as the 7bits are in those combinations. Therefore U+0000 - U+007F charactersencoded in UTF-32 or UTF-16 are in ASCII codes; they are truly ASCIIcharacters. When a bit combination '?' (0x3F) is loaded into a 64-bitregister on a CPU, is it still an ASCII character? My view is yes.

They are not in ASCII *encoding*, as *encoding* is limited to a sequenceof 7-bit or 8-bit combinations (X3.4-1986 Section 2.1.1(1)). My pointhere is that to be correct, one ought to use some sort of preposition,namely "ASCII in Unicode" or "ASCII [characters/code points/scalarvalues] in Unicode"--but if you slice off "in Unicode", you are leftwith "ASCII" and that is just fine. This is another basis for theproposition that "beyond ASCII" (e.g., "characters beyond ASCII [inUnicode]", "beyond the ASCII range [of Unicode]") makes sense.


Regards,

Sean

Beyond ASCII

Reply via email to