On 16/12/2004 13:20, Lars Kristan wrote:
... > ... So there is a > mathematically proved > inconsistency in your requirements.
This only proves that requirements cannot be met by a single conversion pair. If they could be met, then such a conversion could be used immediately for converting to and from UTF-8.
However, requirements 1 and 2 are actually taken from Unicode standard, they are not my requirements.
Well, let's clarify. The existing situation is:
1. for all valid UTF-8 strings s8, UTF-16(s8) is a valid UTF-16 string and UTF-8(UTF-16(s8)) = s8
2. for all valid UTF-16 strings s16, UTF-8(s16) is a valid UTF-8 string and UTF-16(UTF-8(s16)) = s16
These standard definitions of UTF-8 and UTF-16 will not be changed, so don't even think about asking for this.
Your requirement is a pair of functions f and g, such that:
3. for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)
4. for all valid UTF-8 strings s8, g(f(s8)) = s8
5. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and g(f(t8)) = t8
The following is apparently NOT a requirement:
6. for all valid UTF-16 strings s16, g(s16) = UTF-8(s16)
But the note the following logical chain, all for all valid UTF-16 strings s16:
2 => s16 = UTF-16(UTF-8(s16))
3 => s16 = f(UTF-8(s16))
2 => UTF-8(s16) is a valid UTF-8 string, hence by 4 f(UTF-8(s16)) can be operated on by g
=> g(s16) = g(f(UTF-8(s16)))
substituting UTF-8(s16) for s8:
4 => g(s16) = UTF-8(s16)
which proves 6.
Hence the non-requirement is in fact a logical consequence of the requirements, and that is without even looking at requirement 5.
Therefore 5 implies a contradiction. For any invalid UTF-8 string t8:
5 => f(t8) is a valid UTF-16 string 2 => UTF-8(f(t8)) is a valid UTF-8 string 6 => g(f(t8)) (= UTF-8(f(t8)) ) is a valid UTF-8 string 4 => t8 (= g(f(t8)) ) is a valid UTF-8 string
But this contradicts the premise that t8 is an invalid UTF-8 string.
How's that? Well, they are my requirements also, but instead of "for all valid UTF-x strings", in my case the requirement is relaxed to "for all valid UTF-8 strings that do not contain the 128 replacement codepoints".
So do you mean to relax the requirement "for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)"? The problem with this is that it is broken by existing filenames which (probably by chance) form the UTF-8 for one of your 128 replacement codepoints. Well, there are not 128 replacement codepoints, and never will be, certainly not in the BMP - unless you are talking about unpaired surrogates or the PUA.
...
No, this is the most important requirement. The idea is to obtain a VALID UTF-16 string. ...
Well, your requirements are logically contradictory. Sorry.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/

