2015-09-21 21:54 GMT+02:00 Tony Jollans <[email protected]>: > The actual octets are, of course, used in combinations, but not singly in > any way that requires them to be described in Unicode terms. Or am I > missing > something fundamental? >
The term you are looking for are described in the standard describing the standard Unicode encoding forms and schemes. If you're speaking at the octet level, the proper term is "8-bit code unit" and then look for the definition of "code units", not "code points" and not "scalar values" or "characters" as well. "Character" has another definition in programming languages, but Unicode is not bound normatively to any programming language and their actual storage size or transport size is not part of the standard, you'll need to look into the technical documenttion of each programming language or transport protocol or storage device: this is out of scope of the standard itself, each environment describing their own API, library or adapter to interface or convert data correctly with Unicode elements and texts, sometimes with several competing interfaces or converters: on this list we are only focused on standard interchange formats, but the problem is solved since long, notably with Internet standards and RFCs such as MIME which has also its own definition of "characters", because these standards are not exclusively bound to Unicode but also support other legacy standards. But even in this case these definitions are only at an upper layer only and the lower layer may use other conversions, including data compression technics, escaping modes, or could even workl with units smaller than octets or even smaller than binary bits, or could multiplex some bits with some complex state representation for example in modems working with bits spread over a matrix of non-binary states with redundancy and autocorrection. Even the order of bits is not defined in the Unicode standard or in the internal lower layers of an interface (these are not the layers concerned for interchange in a large network, they are specific to each physical or virtual link between specific pairs of hosts, buses/cables, hubs, switches, or routers and at this level they do not even have to know if the data is actually containing text or which upper layer encoding forms are used or implied. So let's get back to your focus: you're wondering if there's a term for octets with the high bit set, in the context of texts processed with some standard Unicode algorithms. - We have a term for 16-bit code units used in combinations to encode a single code point : these are "surrogates". - For 8-bit code units, there are at least 3 encodings described : UTF-8, CESU-8 and SCSU. Each one has its own subranges of octets values processed differently. The best way to name these ranges is to look into the standard documentation of these encoding schemes. And these definitions are independant of those used in other encoding schemes/forms (including those defined by TUS), they do not operate at the same level and these independant levels shuold (must?) be blackboxed (their scope is stronly defined, and transparent in all other layers of processing, and all ayers are replaceable by another competing encoding. Note that initially, even TUS did not define any encoding scheme below the level of code points and their scalar values. There was then no concept of "code units", that were stadnardized only because a few encoding schemes (UTFs) were integrated in a stadnard annexe, then directly in TUS itself as they became ubiquitous for handling Unicode texts, and outweighted all other (older) legacy standards (including Internet standards which still survive with their mandarory or optional support of legacy standards: UTF-8 proved to be the easiest encoding working with a basic level of compatibility with these older standards).

