On 2002-apr-09, Shlomi Tal and Doug Ewell discussed on this list a UTF-7 signature byte sequence of +/v8- (which was news to me). (Subject "MS/Unix BOM FAQ again (small fix)")
I "meditated" some over this - +/v8 is the encoding of U+FEFF as the first code point in a text. So far, so good. The '-' as the next byte switches UTF-7 back to direct-encoding of a subset of US-ASCII. What if there is no '-' there? What if a non-ASCII code point immediately follows the U+FEFF? In such a case, depending on the following code point, the first four bytes could be +/v8 or +/v9 or +/v+ or +/v/ The 4th byte will not be '8' if the following code point is >=U+4000. This illustrates a property of UTF-7 that sets it further apart from most encodings than for example SCSU and BOCU-1: In most Character Encoding Schemes, consecutive code units/points are encoded in _separate_, consecutive byte sequences. In UTF-7, byte sequences overlap and many bytes in the encoding (2 out of 8 I think) contain pieces of two adjacent code units. This is more like in Huffman codes. As one conclusion, one cannot always remove the intial encoding of U+FEFF from a UTF-7 byte stream and start converting from the following byte offset. One must instead remove U+FEFF _from the output_. This is also true for BOCU-1 because the initial U+FEFF is relevant for its state, although code points are encoded with non-overlapping byte sequences. For SCSU and all UTFs it is equally safe to skip the signature bytes before decoding or the intial U+FEFF after decoding. (The SCSU signature is defined to not change the intial converter state; it is one of several SCSU encodings of U+FEFF.) For as long as we keep using the/an encoding of U+FEFF as the signature for each Unicode encoding, it is possible to remove U+FEFF from the output when a signature was detected as such. Sorry for rambling; back to work... markus