No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really **do** have UTF-8 encodings (using two bytes).
The only safe way to represent arbitrary bytes within strings when they are not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a "UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!) This is what Java does for representing U+0000 by (0xC0,0x80) in the compiled Bytecode or via the C/C++ interface for JNI when converting the java string buffer into a C/C++ string terminated by a NULL byte (not part of the Java string content itself). That special sequence however is really exposed in the Java API as a true unsigned 16-bit code unit (char) with value 0x0000, and a valid single code point. The same can be done for reencoding each invalid byte in non-UTF-8 conforming texts using sequences with a "UTF-8-like" scheme (still compatible with plain UTF-8 for every valid UTF-8 texts): you may either: * (a) encode each invalid byte separately (using two bytes for each), or by encoding them by groups of 3-bits (represented using bytes 0xF8..0FF) and then needing 3 bytes in the encoding. * (b) encode a private starter (e.g. 0xFF), followed by a byte for the length of the raw bytes sequence that follows, and then the raw bytes sequence of that length without any reencoding: this will never be confused with other valid codepoints (however this scheme may no longer be directly indexable from arbitrary random positions, unlike scheme a which may be marginally longer longer) But both schemes (a) or (b) would be useful in editors allowing to edit arbitrary binary files as if they were plain-text, even if they contain null bytes, or invalid UTF-8 sequences (it's up to these editors to find a way to distinctively represent these bytes, and a way to enter/change them reliably. There's also a possibility of extension if the backing store uses UTF-16, as all code units 0x0000.0xFFFF are used, but one scheme is possible by using unpaired surrogates (notably a low surrogate NOT prefixed by a high surrogate: the low surrogate already has 10 useful bits that can store any raw byte value in its lowest bits): this scheme allows indexing from random position and reliable sequencial traversal in both directions (backward or forward)... ... But the presence of such extension of UTF-16 means that all the implementation code handling standard text has to detect unpaired surrogates, and can no longer assume that a low surrogate necessarily has a high surrogate encoded just before it: it must be tested and that previous position may be before the buffer start, causing a possibly buffer overrun in backward direction (so the code will need to also know the start position of the buffer and check it, or know the index which cannot be negative), possibly exposing unrelated data and causing some security risks, unless the backing store always adds a leading "guard" code unit set arbitrarily to 0x0000. Le mer. 12 sept. 2018 à 00:48, J Decker via Unicode <unicode@unicode.org> a écrit : > > > On Tue, Sep 11, 2018 at 3:15 PM Hans Åberg via Unicode < > unicode@unicode.org> wrote: > >> >> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode < >> unicode@unicode.org> wrote: >> > >> > On Tue, 11 Sep 2018 21:10:03 +0200 >> > Hans Åberg via Unicode <unicode@unicode.org> wrote: >> > >> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using >> >> LaTeX files with sections in different Cyrillic and Latin encodings, >> >> changing the editor encoding while typing. >> > >> > Rather like some of the old Unicode list archives, which are just >> > concatenations of a month's emails, with all sorts of 8-bit encodings >> > and stretches of base64. >> >> It might be useful to represent non-UTF-8 bytes as Unicode code points. >> One way might be to use a codepoint to indicate high bit set followed by >> the byte value with its high bit set to 0, that is, truncated into the >> ASCII range. For example, U+0080 looks like it is not in use, though I >> could not verify this. >> >> > it's used for character 0x400. 0xD0 0x80 or 0x8000 0xE8 0x80 0x80 > (I'm probably off a bit in the leading byte) > UTF-8 can represent from 0 to 0x200000 every value; (which is all defined > codepoints) early varients can support up to U+7FFFFFFF... > and there's enough bits to carry the pattern forward to support 36 bits or > 42 bits... (the last one breaking the standard a bit by allowing a byte > wihout one bit off... 0xFF would be the leadin) > > 0xF8-FF are unused byte values; but those can all be encoded into utf-8. >