Re: Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)

Philippe Verdy Fri, 17 Dec 2004 10:59:39 -0800

Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)Lars Kristan wrote:

I wrote: > But note that any occurence of U+EE80 to U+EEFF in the > original NON-UTF-8 > "text" are escaped, despite they are valid Unicode. However, > choosing U+EE80 > to U+EEFF is not a problem because these PUAs are very unlikely to be > present in valid source texts, in absence of a prior PUA-agreement. And would be no problem at all if new codepoints would be assigned for this purpose.

No, it won't happen, because Unicode and ISO/IEC-10646 already states that it encodes abstract characters. What you want is that Unicode allocates a new block of 128 codepoints for non-characters. There are enough non-characters in Unicode, for use specifically for the purpose of allowing internal uses, but not for interchange.

Unfortunately, all the remaining codepoints are unassigned, meaning that a conforming application receiving them must handle them as if they were characters. These codepoints are already valid, and the stability of UTFs requires that they become convertible between all UTFs (encoding forms or encoding schemes), with a unique mapping in all directions for all valid code points.

This finally mean that you want these codepoints recognized as characters sometimes, but not when you perform the conversion with a transform-encoding-syntax. A transform-encoding-syntax must also not modify the codepoints represented by an encoding scheme (or charset), and UTFs have also the property of having a single representation of these codepoints (Note that SCSU is not an UTF, because it allows multiple representation of the same codepoints; it's just an encoding scheme however it preserves the uniqueness of codepoints represented by the encoding scheme).

I really don't think that Unicode needs to allocate codepoints for non-characters, because it would also defeat your requirement that all conforming applications should accept non-characters (and you already stated that you didn't want this to happen). So you're left to using only codepoints already assigned to characters.

That's where transfer-encoding-syntaxes are perfect at work: they map any characters or non-characters to a portable string of assigned characters. They are not required to change the semantics of the transported characters, but they can transform a *character* present in the source string (of characters and non-characters) into a *sequence of characters* (yes this is called "escaping").

If you want to strictly limit the case where escaping of valid characters will happen, the best option you have in Unicode is to use PUAs which are the least likely to happen in original strings (of characters and non-characters), in absence of an explicit agreement.

Note that a Transfer-Encoding-Syntax, to be usable, requires an explicit mutual agreement to allow the conversion in either direction. This existence of a mutual agreement is exactly what for which PUA were created, so I don't see why you should not use them, given that all conforming Unicode applications must treat PUAs as valid characters and not as non-characters (these applications may have restrictions on which valid characters they accept, but then don't expect them to handle all possible internationalized plain texts).

Anyway, it does not matter if the PUAs you choose for your TES comes into conflict with PUAs used in a renderer or font: the latter are *other* interfaces, with their own private agreement about their usage. A renderer which does not know explicitly what is the status of a source PUA must not interpret them as if it obeyed the same agreement as the one between the renderer and a font. Private agreements are not implicitly transferable and not agreed automatically across distinct interfaces (this requires a negociation protocol, and some check in the software to see what needs to be done with conflicting PUAs obeying to distinct agreement).

[ The PUAs present in font tables are only there to allow renderers accessing font tables, for things like internal conversion of source strings of code points to strings of more complex glyphs (such as ligatures or contextual form variants). No PUA will pass the working domain of the renderer, so a renderer should treat all PUAs present in a source string as if they were unknown/unassigned but valid characters, with no glyph (the renderer should then display them with an alternate form such as a default square replacement glyph, or a highlighted box showing the hex code of the PUA, or it may even ignore "silently" these PUAs in the rendered graphic, signaling elsewhere to the user that not all characters could be rendered graphically -- a conforming signal can be an alert dialog, a text in a status bar, a log message on the console, a audible beep, a flashing titlebar, a status indicator returned from its API, a warning message drawn in the margins of the rendered document,...). ]

If security is a concern, then choosing PUA is also the best option, because the most critical systems will be prepared to handle the case of PUAs, but not the case of valid non-PUA characters, which they will let pass through by default (notably in absence of an explicit agreement or specification for acceptable input strings), as opposed to PUAs where a process concerned by security may choose to filter out or substitute by default all possibly conflicting input PUAs.

There are tons of existing TES used everyday in many applications, and none of them required the allocation of distinct codepoints for the encoded strings they generate. Why do you want new characters for this mapping? It's not necessary as demonstrated by all the other existing TES...

Re: Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)

Reply via email to