Re: Unicode and end users - UTF-8B

Markus Scherer Tue, 19 Feb 2002 11:36:13 -0800

Lars Kristan wrote:

> ...
> The same thing should work the other way around, store Windows filenames
> directly into a UTF-16 database and use UTF-8 => UTF-16 conversion for UNIX
> filenames. Hoping that some day most of the data will be UTF-8 makes this
> even more appealing. As for any data that is not - well, the original byte
> sequence can be reconstructed and a re-conversion can be done based on
> user's settings (or selection) at display time. All you need is UTF-8B
> conversion instead of UTF-8.



I have seen this technique before! :-)

EBCDIC databases have long (20 years?) had the notion of "roundtrip conversions" for 
interoperability with ASCII codepages.
They did not formally create new codepages (as UTF-8B would be a new encoding) but 
just "abused" the normal EBCDIC codepage by using a special mapping table.

Such a special "roundtrip" mapping table is a full permutation of an originator 
codepage (ASCII-based) onto the database codepage (EBCDIC family).
This works best with 8-bit single-byte codepages on both ends, otherwise the 
originator codepage must have no more valid codes than the database one (used to be 
the case because few ASCII codepages came close to the 36000-some codes 
EBCDIC-stateful codepages could express.)

As a full permutation, characters are mapped faithfully if they exist in both 
codepages, but other characters' codes are mapped arbitrarily and _reversibly_. So a 
(TM) symbol in an ASCII-family codepage may (for example) be mapped to a Delete 
control in the EBCDIC-family database codepage; it's preserved because when the client 
retrieves data, the Delete control gets mapped back to (TM).
This is apparently like UTF-8B, where the roundtrip of arbitrary bytes through UTF-8B 
and back preserves the original bytes.

As users learn these days, problems come in when the data in the database codepage is 
used outside the closed system with the two co-dependent codepages.
Printing from the database, conversion from the database codepage to other codepages 
than the originator/client one, conversion/migration to Unicode can be a nightmare and 
may require to first convert back to the originator codepage.

It is certainly legitimate to solve certain problems by constructing such 
roundtrip-faithful mappings.
I don't think that results of such mappings should be advertised as general-purpose 
encodings. They are just abuses of regular encodings, useful in closed systems and for 
particular circumstances.

markus

Re: Unicode and end users - UTF-8B

Reply via email to