Re: [XeTeX] how to do (better) searchable PDFs in xelatex?
Hi Peter, Jonathan, On 16/10/2012, at 2:02, Peter Baker wrote: > On 10/15/12 10:59 AM, Jonathan Kew wrote: >> >> That's exactly the problem - these glyphs are encoded at PUA codepoints, so >> that's what (most) tools will give you as the corresponding character data. >> If they were unencoded, (some) tools would use the glyph names to infer the >> relevant characters, which would work better. >> >>> Small caps are named like "a.sc" and they are unencoded. >> And as they're unencoded, (some) tools will look at the glyph name and map >> it to the appropriate character. > > I've been trying to explain this: but Jonathan does it much better than I > did, and with more authority. Yes, but why would he tools be designed this way? Surely unencoded means that the code-point has not been assigned yet, and may be assigned in future. So using these is asking for trouble. Was not the intention of PUA to be the place to put characters that you need now, but have no corresponding Unicode point? This is precisely where using the font name should work. Or am I missing something? So why would the tool be designed to infer the right composition of characters when a ligature is properly named at an unencoded point, but that same algorithm is not used when it is at a PUA point? > > P. Perplexed. Ross PS. would not this be particulr issue with ligatures be resolved with a /ToUnicode CMap for the font, which can do one–many assignments. Yes, this does not handle the many–one and many–many requirements of complex scripts, but that isn't what was being reported here, and is a much harder recognition problem. Besides, it isn't clear there what copy-paste should best produce. Nor how to specify the desired search. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] how to do (better) searchable PDFs in xelatex?
On 10/15/12 10:59 AM, Jonathan Kew wrote: That's exactly the problem - these glyphs are encoded at PUA codepoints, so that's what (most) tools will give you as the corresponding character data. If they were unencoded, (some) tools would use the glyph names to infer the relevant characters, which would work better. Small caps are named like "a.sc" and they are unencoded. And as they're unencoded, (some) tools will look at the glyph name and map it to the appropriate character. I've been trying to explain this: but Jonathan does it much better than I did, and with more authority. P. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] how to do (better) searchable PDFs in xelatex?
On 15/10/12 15:19, Peter Baker wrote: Here's an example file: %&program=xelatex %&encoding=UTF-8 Unicode \documentclass{book} \usepackage[silent]{fontspec} \usepackage{xltxtra} \setromanfont{Junicode} \begin{document} \noindent You can search for these: \noindent first flat office afflict\\ \noindent But you cannot search for these: \noindent after fifty front\\ \noindent You can search for these words because small caps have been moved out of the PUA in recent versions of Junicode: \noindent\textsc{first flat office afflict after fifty front} \end{document} Here's a link to an uncompressed (using pdftk) PDF: https://dl.dropbox.com/u/35611549/test_uncompressed.pdf I honestly have no idea what I'm looking at when I open that in Emacs. Here is info about the Junicode ligatures that can't be searched: glyph name f_t, encoding U+EECB glyph name f_t_y, encoding U+EED0 glyph name f_r, encoding U+EECA That's exactly the problem - these glyphs are encoded at PUA codepoints, so that's what (most) tools will give you as the corresponding character data. If they were unencoded, (some) tools would use the glyph names to infer the relevant characters, which would work better. Small caps are named like "a.sc" and they are unencoded. And as they're unencoded, (some) tools will look at the glyph name and map it to the appropriate character. The font is generated by FontForge. The PDF is generated by XeTeX (XeLaTeX actually). I don't know if another program (e.g. LuaTeX) would yield different results. Peter On 10/14/12 10:56 PM, Ross Moore wrote: > Any chance of providing example PDFs of this? (preferably using > uncompressed streams, to more easily examine the raw PDF content) Do > the documents also have CMap resources for the fonts, or is the sole > means of identifying the meaning of the ligature characters coming > from their names only? Have these difficulties been reported to Adobe > recently? If not, would you mind me doing so? -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] how to do (better) searchable PDFs in xelatex?
Here's an example file: %&program=xelatex %&encoding=UTF-8 Unicode \documentclass{book} \usepackage[silent]{fontspec} \usepackage{xltxtra} \setromanfont{Junicode} \begin{document} \noindent You can search for these: \noindent first flat office afflict\\ \noindent But you cannot search for these: \noindent after fifty front\\ \noindent You can search for these words because small caps have been moved out of the PUA in recent versions of Junicode: \noindent\textsc{first flat office afflict after fifty front} \end{document} Here's a link to an uncompressed (using pdftk) PDF: https://dl.dropbox.com/u/35611549/test_uncompressed.pdf I honestly have no idea what I'm looking at when I open that in Emacs. Here is info about the Junicode ligatures that can't be searched: glyph name f_t, encoding U+EECB glyph name f_t_y, encoding U+EED0 glyph name f_r, encoding U+EECA Small caps are named like "a.sc" and they are unencoded. The font is generated by FontForge. The PDF is generated by XeTeX (XeLaTeX actually). I don't know if another program (e.g. LuaTeX) would yield different results. Peter On 10/14/12 10:56 PM, Ross Moore wrote: Any chance of providing example PDFs of this? (preferably using uncompressed streams, to more easily examine the raw PDF content) Do the documents also have CMap resources for the fonts, or is the sole means of identifying the meaning of the ligature characters coming from their names only? Have these difficulties been reported to Adobe recently? If not, would you mind me doing so? -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] how to do (better) searchable PDFs in xelatex?
2012/10/15 Mojca Miklavec : > On Mon, Oct 15, 2012 at 12:04 AM, Andrew Cunningham wrote: >> >> This is the nature of the PDF format. It is a preprint format the focuses on >> glyphs rather than characters >> >> It partly depends on the font, and the OT features being used. >> >> In theory you can have ActualText in the PDF, but once you move to complex >> scripts all bets are off. Without a complete rewrite of the PDF standard >> fidelity to the text is not really possible. PDF format wasn't designed >> to do it. > > I might be wrong, but pdfTeX-generated documents work fine (after > adding encoding vector) even though the glyphs populate "random" slots > is the font (for example T1 encoding) that have nothing to do with > Unicode. > It works with good fonts in good viewers because these "good fonts" assign proper names to the glyphs. I tested this many years ago not only in pdftex but also with tex + dvips + either ps2pdf from GS or Adobe Distiller. > It should be possible to do something similar in XeTeX/LuaTeX. > > I'm not saying that this would solve problems of copy-pasting Arabic > scripts, but it should be possible to cover alternate glyphs for Latin > scripts at least. > > Mojca > > PS: From > http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html > > There is an optional auxiliary structure called the "ToUnicode" table > that was introduced into PDF to help with this text retrieval problem. > A ToUnicode table can be associated with a font that does not normally > have a way to determine the relationship between glyphs and Unicode > characters (some do). The table maps strings of glyph identifiers into > strings of Unicode characters, often just one to one, so that the > proper character strings can be made from the glyph references in the > file. > ToUnicode can only replace a byte with a sequence of bytes. Type1 font can encode only 256 characters, therefore such mapping is possible. Many years ago I developed a ToUnicode map for Velthuis Devanagari: http://icebearsoft.euweb.cz/dvngpdf/ Complex scripts would require many-to-many mapping but it is impossible with toUnicode. > > -- > Subscriptions, Archive, and List information, etc.: > http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] how to do (better) searchable PDFs in xelatex?
On Mon, Oct 15, 2012 at 10:32 AM, Joe Corneli wrote: > > but fi => fi, despite the latter > copy-pasting as "fi". Somehow this does seem like it's just an > oversight on the part of the font developers. It can also be an oversight on the part of PDF viewer developers. Apple decomposes all accented Latin characters for example ("C" followed by "composing caron" instead of just "Č"). I always found that horribly annoying. On the other hand it had zero problems with infinity, other math symbols and Greek letters from pdfTeX-generated documents, so I usually had no problems copy-pasting mathematical formulas. I only had to add an extra pair of dollars to get a nicely formatted formula. You should try "pdftotext", Adobe Acrobat, Apple's Preview (if you have access to it), some free viewers, ... and the results are often different. In my opinion any decent PDF viewer should be able to convert the "fi" ligature into two separate letters when copy-pasting. This cannot be font designer's fault. Mojca PS: It is 2012 ... and since a couple of recent months Opera (web browser) still fails to display the most basic accented Latin characters (like š & ž) even when page encoding is properly set. Yes, it's unbelievable. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] how to do (better) searchable PDFs in xelatex?
On Mon, Oct 15, 2012 at 9:13 AM, Mojca Miklavec wrote: > I might be wrong, but pdfTeX-generated documents work fine (after > adding encoding vector) even though the glyphs populate "random" slots > is the font (for example T1 encoding) that have nothing to do with > Unicode. > > It should be possible to do something similar in XeTeX/LuaTeX. > > I'm not saying that this would solve problems of copy-pasting Arabic > scripts, but it should be possible to cover alternate glyphs for Latin > scripts at least. Sounds like a possible solution to my problem... and indeed, this MWE is searchable -- \documentclass{book} \usepackage{libertine} \begin{document} Quantitative/Prefix. \end{document} ... without the Qu => ligature, but fi => fi, despite the latter copy-pasting as "fi". Somehow this does seem like it's just an oversight on the part of the font developers. Indeed, it seems this is already noted in their bug tracker: http://sourceforge.net/tracker/index.php?func=detail&aid=3575137&group_id=89513&atid=590374 I guess there's little else to do but wait until that's fixed. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] how to do (better) searchable PDFs in xelatex?
On Mon, Oct 15, 2012 at 12:04 AM, Andrew Cunningham wrote: > > This is the nature of the PDF format. It is a preprint format the focuses on > glyphs rather than characters > > It partly depends on the font, and the OT features being used. > > In theory you can have ActualText in the PDF, but once you move to complex > scripts all bets are off. Without a complete rewrite of the PDF standard > fidelity to the text is not really possible. PDF format wasn't designed > to do it. I might be wrong, but pdfTeX-generated documents work fine (after adding encoding vector) even though the glyphs populate "random" slots is the font (for example T1 encoding) that have nothing to do with Unicode. It should be possible to do something similar in XeTeX/LuaTeX. I'm not saying that this would solve problems of copy-pasting Arabic scripts, but it should be possible to cover alternate glyphs for Latin scripts at least. Mojca PS: From http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html There is an optional auxiliary structure called the "ToUnicode" table that was introduced into PDF to help with this text retrieval problem. A ToUnicode table can be associated with a font that does not normally have a way to determine the relationship between glyphs and Unicode characters (some do). The table maps strings of glyph identifiers into strings of Unicode characters, often just one to one, so that the proper character strings can be made from the glyph references in the file. -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex