At 10:32 AM 2/5/02 -0800, Magda Danish (Unicode) wrote:

>Begin forwarded message:
>
>From: [EMAIL PROTECTED]
>Date: 2002-02-05 10:44:20 -0800
>To: [EMAIL PROTECTED]
>Subject: Using Unicode Characters in ASCII Streams
>
>Hallo,
>
>we are a manufacturer of time and attendance terminals which are 
>transfering data using 8-Bit character streams (ASCII + Latin 1 which is a 
>subset of Unicode: 0000 to 00FF). Because of the history and the future 
>compatibilty it is not possible to change to 16-Bit characters used with 
>Unicode. But we want to use European Latin A (0100 to 0170) and Extended 
>Latin B (0180 to 01FF) expansions for east Europian countries. So it must 
>be possible to add any Unicode character to the 8-Bit streams. Now we are 
>looking for a standard method to do so.
>
>Now here is my question: Is there a method to add any Unicode character to 
>a 8-Bit ASCII stream?
>
>Example:  The Polish word "wyjs´cie" with character "Latin Small Letter s 
>with Acute" (015B) in the middle (s´ is one character)
>           We want to transfer something like this: "wyj\u015Bcie"; here 
> we use \u015B like coding in Java (our standard development language)
>
>           Now it is possible to decode the two characters "\u" and the 
> following four characters as one Unicode character. But we know this is 
> not a standard!

There are three or four options for forcing Unicode into an 8-bit format.

a) Use UTF-8. This preserves ASCII, but the characters >127 are different 
from Latin-1.

b) Use Java or C style escapes, of the form \uXXXXX or \xXXXXX. This format 
is not standard for text files, but well defined in the framework of the 
languages in question, primarily for source files.

c) Use the &#xXXXX; or &#DDDDD; numeric character escapes as in HTML or XML.
Again, these are not standard for plain text files, but well defined within 
the framework of these markup languages.

d) Use SCSU (http://www.unicode.org/unicode/reports/tr6). This format 
compresses Unicode into 8-bit format, preserving most of ASCII, but using
some of the control codes as commands for the decoder.

Of these four approaches, d) uses the least space, a) is the most widely 
supported in plain text files and b) and c) use the most space, and are widely
supported *within* HTML and XML files.

All four require that the receiver can understand that format, but a) is 
considered one of the three equivalent Unicode Encoding Forms and therefore 
standard. The use of b, or c out of their
given context would definitely be considered non-standard, but could be a 
good solution for internal data transmission. The use of SCSU is itself a 
standard (for compressed data streams) but few general purpose receivers 
support SCSU, so it is again most useful in internal data transmission.

If *all* characters above 255 are encoded with method b) or c) then the 
first three methods give unique representations. That is, if two strings in 
each of these formats match, then the corresponding Unicode character 
strings match as well. SCSU does *not* have this property.

A./

Reply via email to