Re: Joined "ti" coded as "O" in PDF

Steve Swales Fri, 06 May 2016 08:52:36 -0700

This discussion seems to have fizzled out, but I’m concerned that there’s a 
real world problem here which is at least partially the concern of the 
consortium, so let me stir the pot and see if there’s still any meat left.


On the current release of MacOS (including the developer beta, for your 
reference, Peter), if you use Calibri font, for example, in any app (e.g. 
notes), to write words with “ti” (like internationalization), then press 
“Print" and “Open PDF in Preview”, you get a PDF document with the joined “ti”. 
 Subsequently cutting and pasting produces mojibake, and searching the document 
for words with“ti” doesn’t work, as previously noted.

I suppose we can look on this as purely a font handling/MacOS bug, but I’m 
wondering if we should be providing accommodations or conveniences in Unicode 
for it to work as desired.

-steve



> On Mar 21, 2016, at 1:40 AM, Philippe Verdy <[email protected]> wrote:
> 
> Are those PDF supposed to be searchable inside of them ? For archival 
> purpose, the PDF are stored in their final form, and search is performed by 
> creating a database of descriptive metadata. Each time one wants formal 
> details, they have to read the original the way it was presented (many PDFs 
> are jsut scanned facsimiles of old documents which originately were not even 
> in numeric plain-text, they were printed or typewritten, frequently they 
> include graphics, handwritten signatures, stamped seals...)
> 
> Being able to search plain-text inside a PDF is not the main objective (and 
> not the priority). The archival however is a top priority (and there's no 
> money to finance a numerisation and no human resource available to redo this 
> old work, if needed other contributors will recreate a plain-text version, 
> possibly with rich-text features, e.g. in Wikisource for old documents that 
> fall in the public domain).
> 
> PDF/A-1a is meant only for creating new documents from a original plain-text 
> or rich-text document created with modern word-processing applications. But 
> this specification will frequently have to be broken, if there's the need to 
> include handwritten or supplementary elements (signatures, seals...) whose 
> source is not the original electronic document but the printed paper over 
> which the annotations were made: it is this paper document, not the 
> electronic document which is the official final source (we've got some 
> important legal paper whose original has other marks including traces of beer 
> or coffee, or partly burnt, the paper itself has several alterations, but it 
> is the original "as is", and for legal purpose the only acceptable archival 
> form as a PDF must ignore all the PDF/A-1a constraints, not meant to 
> represent originals accurately).
> 
> 2016-03-20 20:52 GMT+01:00 Tom Gewecke <[email protected] 
> <mailto:[email protected]>>:
> 
> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) <[email protected] 
> > <mailto:[email protected]>> wrote:
> >
> > Usually, the archive feature pertains only to the fact that you can 
> > reproduce the final form, not to being able to get at the correct source 
> > (plain text backbone) for the document.
> 
> My understanding is that PDF/A-1a is supposed to be searchable.
> 
> 
> 
>

Re: Joined "ti" coded as "O" in PDF

Reply via email to