On 16/12/2004 13:20, Lars Kristan wrote:

...
> ... So there is a
> mathematically proved
> inconsistency in your requirements.

This only proves that requirements cannot be met by a single conversion pair. If they could be met, then such a conversion could be used immediately for converting to and from UTF-8.

However, requirements 1 and 2 are actually taken from Unicode standard, they are not my requirements.


Well, let's clarify. The existing situation is:

1. for all valid UTF-8 strings s8, UTF-16(s8) is a valid UTF-16 string and UTF-8(UTF-16(s8)) = s8
2. for all valid UTF-16 strings s16, UTF-8(s16) is a valid UTF-8 string and UTF-16(UTF-8(s16)) = s16


These standard definitions of UTF-8 and UTF-16 will not be changed, so don't even think about asking for this.

Your requirement is a pair of functions f and g, such that:

3. for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)
4. for all valid UTF-8 strings s8, g(f(s8)) = s8
5. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16 string and g(f(t8)) = t8


The following is apparently NOT a requirement:

6. for all valid UTF-16 strings s16, g(s16) = UTF-8(s16)

But the note the following logical chain, all for all valid UTF-16 strings s16:

2 => s16 = UTF-16(UTF-8(s16))
3 => s16 = f(UTF-8(s16))
2 => UTF-8(s16) is a valid UTF-8 string, hence by 4 f(UTF-8(s16)) can be operated on by g
=> g(s16) = g(f(UTF-8(s16)))
substituting UTF-8(s16) for s8:
4 => g(s16) = UTF-8(s16)
which proves 6.


Hence the non-requirement is in fact a logical consequence of the requirements, and that is without even looking at requirement 5.

Therefore 5 implies a contradiction. For any invalid UTF-8 string t8:

5 => f(t8) is a valid UTF-16 string
2 => UTF-8(f(t8)) is a valid UTF-8 string
6 => g(f(t8)) (= UTF-8(f(t8)) ) is a valid UTF-8 string
4 => t8 (= g(f(t8)) ) is a valid UTF-8 string

But this contradicts the premise that t8 is an invalid UTF-8 string.

How's that? Well, they are my requirements also, but instead of "for all valid UTF-x strings", in my case the requirement is relaxed to "for all valid UTF-8 strings that do not contain the 128 replacement codepoints".


So do you mean to relax the requirement "for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)"? The problem with this is that it is broken by existing filenames which (probably by chance) form the UTF-8 for one of your 128 replacement codepoints. Well, there are not 128 replacement codepoints, and never will be, certainly not in the BMP - unless you are talking about unpaired surrogates or the PUA.
...


No, this is the most important requirement. The idea is to obtain a VALID UTF-16 string. ...


Well, your requirements are logically contradictory. Sorry.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Reply via email to