Yes, but this does not mean that 0xFFFFFFF cannot be used as a (32-bit) code unit in "32-bit strings", even if it is not a valid code point with a valid scaar value in any legacy or standard version of UTF-32.
The limitation to 0x7FFFFFF was certainly just there to avoid sign/unsigned differences in 32-bit integers (if ever they were in fact converted to larger integers such as 64-bit to exhibit differences in APIs returning individual code units). It's true that in 32-bit integers (signed or unsigned) you cannot differenciate 0xFFFFFFF from -1 (which is generally the value chosen in C/C++ standard libraries for representing the EOF condition returned by functions or macros like getchar(). But EOF conditions do not require to be differentiated when you are scanning positions in a buffer of 32-bit integers (instead you compare the relative index in the buffer with the buffer length, or the buffer object includes a separate method to test this condition). But today, where programming environment are going to 64-bit by default, the APIs that return an integer when reading individual code positions will return them as 64-bit integers, even if the inner storage uses 32-bit code units: 0xFFFFFFFF will then be returned as a positive integer and not -1 used for EOF. This was not still true when the legacy UTF-32 encoding was created, where a majority of environments were still only running 32-bit or 16-bit code; for the 16-bit code, the 0xFFFF code unit, for the U+FFFF code point, had to be assigned to a non-character to limit problems of confusions with the EOF condition in C/C++ or similar APIs in other languages (when they cannot throw an exception instead of a distinct EOF value). Well, there are stil la lot of devices running 32-bit code (notably in guest VMs, and in small devices) and written in C/C++ with the old standard C library, but without OOP features (such as exceptions, or methods for buffering objects). In Java, the "int" datatype (which is 32-bit and signed) has not been extended to 64-bit, even on platforms where 64-bit integers are the internal datatype used by the JVM in its natively compiled binary code. Once again, "code units" and "x-bit strings" are not bound to any Unicode or ISO/IEC 10646 or legacy RFC contraints related to the current standard UTFs or legacy (obsoleted) UTF's. And I still don't see any productive need for "Unicode x-bit strings" in TUS D80-D83, when all that is needed for the conformance is NOT the whole range of valid code units, but only the allowed range of scalar values (which there's only the need for code units to be defined in a large enough set of distinct values: The exact cardinality of this set does not matter, and there can always exist additional valid "code units" not bound to any valid "scalar value" or to a minimal set of distinct "Unicode code units" needed to support the standard Unicode encoding forms). Even the Unicode scalar values or the implied values of "Unicode code units" to not have to be aligned with the effective native values of "code units" used in the lower level... except for the standard encoding schemes for 8-bit interchanges, where byte order matters... but still not the lower level bit order and the native hardware representation of invidually addressable bytes which may be sometimes larger than 8-bit, with some other control bits or framing bits, and sometimes even with variable bit sizes depending on their relative position in transport frames ! 2015-05-11 19:44 GMT+02:00 Doug Ewell <[email protected]>: > Hans Aberg <haberg dash 1 at telia dot com> wrote: > > >>> However I wonder what would be the effect of D80 in UTF-32: is > >>> <0xFFFFFFFF> a valid "32-bit string" ? > >> > >> The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it > >> cannot represent a unit of encoded text in a UTF-32 string. > > > > Even though the values with highest bit set are not a part of original > > UTF-32, it can easily be extended also to original UTF-8, which may be > > simpler to implement. > > "Original UTF-8," regardless of where defined, only ever encoded scalar > values up to 0x7FFFFFFF. See, for example, RFC 2279. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸 > > >

