Greenwood, Timothy wrote: >This question is pertinent to one asked me the other day for which I did not have an >answer. Is the code set of an original document relevant for PDF - say EUC, SJIS, PDF >- will the output perform text searches correctly for differing code set inputs? > PDF documents logically contain two streams: one of characters, and one of glyphs.
The glyph stream is always present physically, and is used for rendering. Depending on the fonts involved, the PDF generator, and all sorts of factors, the meaning of the numbers in that glyph stream, and the machinery to locate the actual outlines will vary quite a bit. The character stream can be represented explicitly, in which case I am pretty sure it is always a Unicode stream. Alternatively, it can be computed from the glyph stream using various mechanisms; I believe that all the computations described in the PDF spec generate a Unicode stream. The choice of explicit vs implicit character representation is up to the PDF producer. In all cases, I believe that the producer has the responsibility of converting from whatever character standard is used in the original document to Unicode. When the producer is Distiller, it may not have access to the original character content and be forced to create an approximation. Eric.

