Re: Surrogates and noncharacters

Hans Aberg Mon, 11 May 2015 14:57:23 -0700

> On 11 May 2015, at 21:25, Philippe Verdy <[email protected]> wrote:
> 
> Yes, but this does not mean that 0xFFFFFFF cannot be used as a (32-bit) code 
> unit in "32-bit strings", even if it is not a valid code point with a valid 
> scaar value in any legacy or standard version of UTF-32.


The reason I did it was to avoid having a check to throw an exception. It 
merely means that the check for valid Unicode code points, in such a context, 
must be elsewhere.

> The limitation to 0x7FFFFFF was certainly just there to avoid sign/unsigned 
> differences in 32-bit integers (if ever they were in fact converted to larger 
> integers such as 64-bit to exhibit differences in APIs returning individual 
> code units).

Indeed, so I use uint32_t combined with uint32_t, because char can be signed at 
the will of the C/C++ compiler implementer.

> It's true that in 32-bit integers (signed or unsigned) you cannot 
> differenciate 0xFFFFFFF from -1 (which is generally the value chosen in C/C++ 
> standard libraries for representing the EOF condition returned by functions 
> or macros like getchar(). But EOF conditions do not require to be 
> differentiated when you are scanning positions in a buffer of 32-bit integers 
> (instead you compare the relative index in the buffer with the buffer length, 
> or the buffer object includes a separate method to test this condition).

It is s good point - perhaps that was the reason to not allow highest bit set. 
But it is not a problem in C++, would it get UTF-32 streams, as they can throw 
an exception

> But today, where programming environment are going to 64-bit by default, the 
> APIs that return an integer when reading individual code positions will 
> return them as 64-bit integers, even if the inner storage uses 32-bit code 
> units: 0xFFFFFFFF will then be returned as a positive integer and not -1 used 
> for EOF.

Right, the C/C++ languages specifications say that size_t and friend must be 
able to hold any size, and similar for differences. So this forces signed and 
unsigned 64-bit integral types on a 64-bit platform.

> This was not still true when the legacy UTF-32 encoding was created, where a 
> majority of environments were still only running 32-bit or 16-bit code; for 
> the 16-bit code, the 0xFFFF code unit, for the U+FFFF code point, had to be 
> assigned to a non-character to limit problems of confusions with the EOF 
> condition in C/C++ or similar APIs in other languages (when they cannot throw 
> an exception instead of a distinct EOF value).

Right, it might be a non-issue today.

> Well, there are stil la lot of devices running 32-bit code (notably in guest 
> VMs, and in small devices) and written in C/C++ with the old standard C 
> library, but without OOP features (such as exceptions, or methods for 
> buffering objects). In Java, the "int" datatype (which is 32-bit and signed) 
> has not been extended to 64-bit, even on platforms where 64-bit integers are 
> the internal datatype used by the JVM in its natively compiled binary code.

Legacy is a problem.

> Once again, "code units" and "x-bit strings" are not bound to any Unicode or 
> ISO/IEC 10646 or legacy RFC contraints related to the current standard UTFs 
> or legacy (obsoleted) UTF's.
> 
> And I still don't see any productive need for "Unicode x-bit strings" in TUS 
> D80-D83, when all that is needed for the conformance is NOT the whole range 
> of valid code units, but only the allowed range of scalar values (which 
> there's only the need for code units to be defined in a large enough set of 
> distinct values:
> 
> The exact cardinality of this set does not matter, and there can always exist 
> additional valid "code units" not bound to any valid "scalar value" or to a 
> minimal set of distinct "Unicode code units" needed to support the standard 
> Unicode encoding forms).
> 
> Even the Unicode scalar values or the implied values of "Unicode code units" 
> to not have to be aligned with the effective native values of "code units" 
> used in the lower level... except for the standard encoding schemes for 8-bit 
> interchanges, where byte order matters... but still not the lower level bit 
> order and the native hardware representation of invidually addressable bytes 
> which may be sometimes larger than 8-bit, with some other control bits or 
> framing bits, and sometimes even with variable bit sizes depending on their 
> relative position in transport frames !

It is perfectly fine considering the Unicode code points as abstract integers, 
with UTF-32 and UTF-8 encodings that translate them into byte sequences in a 
computer. The code points that conflict with UTF-16 might have been merely 
declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 
and UTF-32. One is going check that the code points are valid Unicode values 
somewhere, so it is hard to see to point of restricting UTF-8 to align it with 
UTF-16.

Re: Surrogates and noncharacters

Reply via email to