On 16/12/2004 13:12, Arcane Jill wrote:

I don't think that that last sentence is true. f(), and its near-inverse, g(), do not claim to be UTFs, and are functions intended to be used only by one particular suite of applications. They are therefore nothing to do with Unicode or the UTC (... or even this list ! ). ...


But Lars is continuing to insist on 128 reserved characters in the BMP. That is relevant to the UTC.

He now seems to want to take them from the Yi Extensions block, and seems to be prepared to take the risk of being assassinated by the Yi, although not by other nations. Well, I don't know much about the Yi, but I did find "The Yi have long been known as fierce warriors." They are not a dead people who can't fight back against being pushed out of the BMP. And no doubt Michael Everson will also fight fiercely for the Yi Extensions block. So, be careful, Lars!

...The fact that I defined f such that f(s) == utf8decode(s) for all valid UTF-8 streams s does not change the status of f() as a purely private-use function.

These are the steps I see happening:
(1) start with an arbitrary octet stream
(2) "escape" it, using some function (which I have called f), to yield a valid UTF-8 stream.
(3) allow normal Unicode functions round-trip this UTF-8 string through UTF-16 (one of Lars' requirements)
(4) finally, "unescape" the UTF-8 using f's inverse function (which I called g) to restore the original octet stream


The escape and unescape functions don't need to be approved by anyone. I'm not suggesting they should be part of any standard - they are merely a mechanism to ensure that step (3) will hold true.

These mechanisms, and any escape mechanism, do not meet the requirement which I codified as "for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)". If this is not in fact a requirement, your mechanism can be made to work, and my logical proof against it fails. But perhaps this is what Lars means by "They don't translate as UTF-8 would to UTF-16": his reserved characters would be an exception to "for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)". In principle this is a way ahead.

In what follows, I presume that this is still a requirement.

Lars's current implementation of this scheme is that his "f" "escapes" the binary octet 1bbbbbbb to 11101110 1011101b 10bbbbbb (or equivalently, byte x becomes the character U+EE00 + x). He is unhappy with this because characters in the range U+EE80 to U+EEFF might be found in real text. So you and I have, between us, suggested three alternative escaping functions, in an attempt to find an escape sequence with a vanishingly small probability of being found in real text. I'm not quite sure why Lars isn't happy with these suggestions - maybe his goal has still not been clearly stated - but either way, since nobody is proposing an amendment to UTFs, it surely isn't the business of the UTC.


The problem can be restated quite simply. Valid UTF-8 has a reversible one-to-one mapping to valid Unicode character sequence, and to valid UTF-16. If there is a mapping from an "invalid UTF-8" string to a valid Unicode character sequence, there is also a mapping to that sequence from a valid UTF-8 string. The mapping "f" is no longer one-to-one but many-to-one. This implies that there cannot be a reverse mapping "g". Lars is rightly dissatisfied with any solution which does not guarantee reversibility.

I note that this argument applies equally to Lars' favoured solution of 128 special characters. If these are valid Unicode characters, they have a valid UTF-8 representation. Both this representation and the isolated bytes will be converted by "f" to the same Unicode characters. This means that "f" is still not one-to-one and so irreversible. That is, unless Lars is actually proposing a change to the standard UTF-8 mapping for these characters. And if he is, that is certainly a matter for the UTC. Or of course if he is abandoning "for all valid UTF-8 strings s8, f(s8) = UTF-16(s8)".

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to