> On 11 May 2015, at 21:25, Philippe Verdy <[email protected]> wrote: > > Yes, but this does not mean that 0xFFFFFFF cannot be used as a (32-bit) code > unit in "32-bit strings", even if it is not a valid code point with a valid > scaar value in any legacy or standard version of UTF-32.
The reason I did it was to avoid having a check to throw an exception. It merely means that the check for valid Unicode code points, in such a context, must be elsewhere. > The limitation to 0x7FFFFFF was certainly just there to avoid sign/unsigned > differences in 32-bit integers (if ever they were in fact converted to larger > integers such as 64-bit to exhibit differences in APIs returning individual > code units). Indeed, so I use uint32_t combined with uint32_t, because char can be signed at the will of the C/C++ compiler implementer. > It's true that in 32-bit integers (signed or unsigned) you cannot > differenciate 0xFFFFFFF from -1 (which is generally the value chosen in C/C++ > standard libraries for representing the EOF condition returned by functions > or macros like getchar(). But EOF conditions do not require to be > differentiated when you are scanning positions in a buffer of 32-bit integers > (instead you compare the relative index in the buffer with the buffer length, > or the buffer object includes a separate method to test this condition). It is s good point - perhaps that was the reason to not allow highest bit set. But it is not a problem in C++, would it get UTF-32 streams, as they can throw an exception > But today, where programming environment are going to 64-bit by default, the > APIs that return an integer when reading individual code positions will > return them as 64-bit integers, even if the inner storage uses 32-bit code > units: 0xFFFFFFFF will then be returned as a positive integer and not -1 used > for EOF. Right, the C/C++ languages specifications say that size_t and friend must be able to hold any size, and similar for differences. So this forces signed and unsigned 64-bit integral types on a 64-bit platform. > This was not still true when the legacy UTF-32 encoding was created, where a > majority of environments were still only running 32-bit or 16-bit code; for > the 16-bit code, the 0xFFFF code unit, for the U+FFFF code point, had to be > assigned to a non-character to limit problems of confusions with the EOF > condition in C/C++ or similar APIs in other languages (when they cannot throw > an exception instead of a distinct EOF value). Right, it might be a non-issue today. > Well, there are stil la lot of devices running 32-bit code (notably in guest > VMs, and in small devices) and written in C/C++ with the old standard C > library, but without OOP features (such as exceptions, or methods for > buffering objects). In Java, the "int" datatype (which is 32-bit and signed) > has not been extended to 64-bit, even on platforms where 64-bit integers are > the internal datatype used by the JVM in its natively compiled binary code. Legacy is a problem. > Once again, "code units" and "x-bit strings" are not bound to any Unicode or > ISO/IEC 10646 or legacy RFC contraints related to the current standard UTFs > or legacy (obsoleted) UTF's. > > And I still don't see any productive need for "Unicode x-bit strings" in TUS > D80-D83, when all that is needed for the conformance is NOT the whole range > of valid code units, but only the allowed range of scalar values (which > there's only the need for code units to be defined in a large enough set of > distinct values: > > The exact cardinality of this set does not matter, and there can always exist > additional valid "code units" not bound to any valid "scalar value" or to a > minimal set of distinct "Unicode code units" needed to support the standard > Unicode encoding forms). > > Even the Unicode scalar values or the implied values of "Unicode code units" > to not have to be aligned with the effective native values of "code units" > used in the lower level... except for the standard encoding schemes for 8-bit > interchanges, where byte order matters... but still not the lower level bit > order and the native hardware representation of invidually addressable bytes > which may be sometimes larger than 8-bit, with some other control bits or > framing bits, and sometimes even with variable bit sizes depending on their > relative position in transport frames ! It is perfectly fine considering the Unicode code points as abstract integers, with UTF-32 and UTF-8 encodings that translate them into byte sequences in a computer. The code points that conflict with UTF-16 might have been merely declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 and UTF-32. One is going check that the code points are valid Unicode values somewhere, so it is hard to see to point of restricting UTF-8 to align it with UTF-16.

