While we are on the subject of ill-formed sequences, I was disappointed to read the following in the Adobe PDF Reference, Fourth Edition, which describes PDF version 1.5 and was published only ten weeks ago:
> Note: PDF does not prescribe what UTF-8 sequence to choose for > representing any given piece of externally specified text as a name > object. In some cases, there are multiple UTF-8 sequences that could > represent the same logical text. Name objects defined by different > sequences of bytes constitute distinct name objects in PDF, even > though the UTF-8 sequences might have identical external > interpretations. I assume that by âmultiple UTF-8 sequences that could represent the same logical text,â Adobe is referring to non-shortest UTF-8 sequences such as <C0 80> and not to Unicode canonical equivalences or something else. No similar warning about âmultiple sequencesâ is given in the sections that deal with UTF-16. Assuming that, this only serves to perpetuate the myth that non-shortest UTF-8 sequences are permitted in Unicode. One can cite the âtighteningâ of the definition of UTF-8 that occurred with Unicode 3.1 and 3.2 as a policy change, but the fact is that encoders have *never* been allowed to generate non-shortest sequences. Earlier conformance requirements that allowed decoders to interpret non-shortest forms were intended only to save a few CPU cycles for mid-â 90s processors, not to give encoders free rein to generate what we now think of as ill-formed UTF-8 text. And in fact, the likelihood is that very little such text exists in the real world. Even the original âFSS-UTFâ definition by Ken Thompson, which was written ten YEARS ago, made this clear: > When there are multiple ways to encode a value, for example UCS 0, > only the shortest encoding is legal. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

