While we are on the subject of ill-formed sequences, I was disappointed
to read the following in the Adobe PDF Reference, Fourth Edition, which
describes PDF version 1.5 and was published only ten weeks ago:

> Note: PDF does not prescribe what UTF-8 sequence to choose for
> representing any given piece of externally specified text as a name
> object. In some cases, there are multiple UTF-8 sequences that could
> represent the same logical text. Name objects defined by different
> sequences of bytes constitute distinct name objects in PDF, even
> though the UTF-8 sequences might have identical external
> interpretations.

I assume that by âmultiple UTF-8 sequences that could represent the same
logical text,â Adobe is referring to non-shortest UTF-8 sequences such
as <C0 80> and not to Unicode canonical equivalences or something else.
No similar warning about âmultiple sequencesâ is given in the sections
that deal with UTF-16.

Assuming that, this only serves to perpetuate the myth that non-shortest
UTF-8 sequences are permitted in Unicode.  One can cite the âtighteningâ
of the definition of UTF-8 that occurred with Unicode 3.1 and 3.2 as a
policy change, but the fact is that encoders have *never* been allowed
to generate non-shortest sequences.

Earlier conformance requirements that allowed decoders to interpret
non-shortest forms were intended only to save a few CPU cycles for mid-â
90s processors, not to give encoders free rein to generate what we now
think of as ill-formed UTF-8 text.  And in fact, the likelihood is that
very little such text exists in the real world.

Even the original âFSS-UTFâ definition by Ken Thompson, which was
written ten YEARS ago, made this clear:

> When there are multiple ways to encode a value, for example UCS 0,
> only the shortest encoding is legal.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/


Reply via email to