UTF-7 signature

Markus Scherer Thu, 11 Apr 2002 11:01:40 -0700

On 2002-apr-09, Shlomi Tal and Doug Ewell discussed on this list a UTF-7 signature 
byte sequence of +/v8- (which was news to me).
(Subject "MS/Unix BOM FAQ again (small fix)")


I "meditated" some over this -

+/v8 is the encoding of U+FEFF as the first code point in a text. So far, so good.
The '-' as the next byte switches UTF-7 back to direct-encoding of a subset of 
US-ASCII.

What if there is no '-' there? What if a non-ASCII code point immediately follows the 
U+FEFF?
In such a case, depending on the following code point, the first four bytes could be
   +/v8  or  +/v9  or  +/v+  or  +/v/

The 4th byte will not be '8' if the following code point is >=U+4000.

This illustrates a property of UTF-7 that sets it further apart from most encodings 
than for example SCSU and BOCU-1:
In most Character Encoding Schemes, consecutive code units/points are encoded in 
_separate_, consecutive byte sequences.

In UTF-7, byte sequences overlap and many bytes in the encoding (2 out of 8 I think) 
contain pieces of two adjacent code units.
This is more like in Huffman codes.

As one conclusion, one cannot always remove the intial encoding of U+FEFF from a UTF-7 
byte stream and start converting from the following byte offset. One must instead 
remove U+FEFF _from the output_.
This is also true for BOCU-1 because the initial U+FEFF is relevant for its state, 
although code points are encoded with non-overlapping byte sequences.

For SCSU and all UTFs it is equally safe to skip the signature bytes before decoding 
or the intial U+FEFF after decoding.
(The SCSU signature is defined to not change the intial converter state; it is one of 
several SCSU encodings of U+FEFF.)

For as long as we keep using the/an encoding of U+FEFF as the signature for each 
Unicode encoding, it is possible to remove U+FEFF from the output when a signature was 
detected as such.

Sorry for rambling; back to work...

markus

UTF-7 signature

Reply via email to