Re: [XeTeX] how to do (better) searchable PDFs in xelatex?

2012-10-15 Thread Ross Moore
Hi Peter, Jonathan,

On 16/10/2012, at 2:02, Peter Baker  wrote:

> On 10/15/12 10:59 AM, Jonathan Kew wrote:
>> 
>> That's exactly the problem - these glyphs are encoded at PUA codepoints, so 
>> that's what (most) tools will give you as the corresponding character data. 
>> If they were unencoded, (some) tools would use the glyph names to infer the 
>> relevant characters, which would work better.
>> 
>>> Small caps are named like "a.sc" and they are unencoded.
>> And as they're unencoded, (some) tools will look at the glyph name and map 
>> it to the appropriate character.
> 
> I've been trying to explain this:  but Jonathan does it much better than I 
> did, and with more authority.

Yes, but why would he tools be designed this way?
Surely unencoded means that the code-point has not been assigned yet, and may 
be assigned in future. So using these is asking for trouble.
Was not the intention of PUA to be the place to put characters that you need 
now, but have no corresponding Unicode point? This is precisely where using the 
font name should work. Or am I missing something?

So why would the tool be designed to infer the right composition of characters 
when a ligature is properly named at an unencoded point, but that same 
algorithm is not used when it is at a PUA point?

> 
> P.

Perplexed.

Ross

PS. would not this be particulr issue with ligatures be resolved with a 
/ToUnicode  CMap for the font, which can do one–many assignments. 
Yes, this does not handle the many–one and many–many requirements of complex 
scripts, but that isn't what was being reported here, and is a much harder 
recognition problem.
Besides, it isn't clear there what copy-paste should best produce. Nor how to 
specify the desired search.


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] how to do (better) searchable PDFs in xelatex?

2012-10-15 Thread Peter Baker

On 10/15/12 10:59 AM, Jonathan Kew wrote:


That's exactly the problem - these glyphs are encoded at PUA 
codepoints, so that's what (most) tools will give you as the 
corresponding character data. If they were unencoded, (some) tools 
would use the glyph names to infer the relevant characters, which 
would work better.



Small caps are named like "a.sc" and they are unencoded.
And as they're unencoded, (some) tools will look at the glyph name and 
map it to the appropriate character.


I've been trying to explain this:  but Jonathan does it much better than 
I did, and with more authority.


P.



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] how to do (better) searchable PDFs in xelatex?

2012-10-15 Thread Jonathan Kew

On 15/10/12 15:19, Peter Baker wrote:

Here's an example file:

%&program=xelatex
%&encoding=UTF-8 Unicode
\documentclass{book}
\usepackage[silent]{fontspec}
\usepackage{xltxtra}
\setromanfont{Junicode}
\begin{document}
\noindent You can search for these:

\noindent first flat office afflict\\

\noindent But you cannot search for these:

\noindent after fifty front\\

\noindent You can search for these words because small caps have been
moved out
of the PUA in recent versions of Junicode:

\noindent\textsc{first flat office afflict after fifty front}
\end{document}

Here's a link to an uncompressed (using pdftk) PDF:

https://dl.dropbox.com/u/35611549/test_uncompressed.pdf

I honestly have no idea what I'm looking at when I open that in Emacs.
Here is info about the Junicode ligatures that can't be searched:

glyph name f_t, encoding U+EECB
glyph name f_t_y, encoding U+EED0
glyph name f_r, encoding U+EECA

That's exactly the problem - these glyphs are encoded at PUA codepoints, 
so that's what (most) tools will give you as the corresponding character 
data. If they were unencoded, (some) tools would use the glyph names to 
infer the relevant characters, which would work better.



Small caps are named like "a.sc" and they are unencoded.
And as they're unencoded, (some) tools will look at the glyph name and 
map it to the appropriate character.



The font is
generated by FontForge. The PDF is generated by XeTeX (XeLaTeX
actually). I don't know if another program (e.g. LuaTeX) would yield
different results.

Peter

On 10/14/12 10:56 PM, Ross Moore wrote:
> Any chance of providing example PDFs of this? (preferably using
> uncompressed streams, to more easily examine the raw PDF content) Do
> the documents also have CMap resources for the fonts, or is the sole
> means of identifying the meaning of the ligature characters coming
> from their names only? Have these difficulties been reported to Adobe
> recently? If not, would you mind me doing so?


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex





--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] how to do (better) searchable PDFs in xelatex?

2012-10-15 Thread Peter Baker

Here's an example file:

%&program=xelatex
%&encoding=UTF-8 Unicode
\documentclass{book}
\usepackage[silent]{fontspec}
\usepackage{xltxtra}
\setromanfont{Junicode}
\begin{document}
\noindent You can search for these:

\noindent first flat office afflict\\

\noindent But you cannot search for these:

\noindent after fifty front\\

\noindent You can search for these words because small caps have been 
moved out

of the PUA in recent versions of Junicode:

\noindent\textsc{first flat office afflict after fifty front}
\end{document}

Here's a link to an uncompressed (using pdftk) PDF:

https://dl.dropbox.com/u/35611549/test_uncompressed.pdf

I honestly have no idea what I'm looking at when I open that in Emacs. 
Here is info about the Junicode ligatures that can't be searched:


glyph name f_t, encoding U+EECB
glyph name f_t_y, encoding U+EED0
glyph name f_r, encoding U+EECA

Small caps are named like "a.sc" and they are unencoded. The font is 
generated by FontForge. The PDF is generated by XeTeX (XeLaTeX 
actually). I don't know if another program (e.g. LuaTeX) would yield 
different results.


Peter

On 10/14/12 10:56 PM, Ross Moore wrote:
Any chance of providing example PDFs of this? (preferably using 
uncompressed streams, to more easily examine the raw PDF content) Do 
the documents also have CMap resources for the fonts, or is the sole 
means of identifying the meaning of the ligature characters coming 
from their names only? Have these difficulties been reported to Adobe 
recently? If not, would you mind me doing so? 



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] how to do (better) searchable PDFs in xelatex?

2012-10-15 Thread Zdenek Wagner
2012/10/15 Mojca Miklavec :
> On Mon, Oct 15, 2012 at 12:04 AM, Andrew Cunningham wrote:
>>
>> This is the nature of the PDF format. It is a preprint format the focuses on
>> glyphs rather than  characters
>>
>> It partly depends on the font, and the OT features being used.
>>
>> In theory you can have ActualText in the PDF, but once you move to complex
>> scripts all bets are off. Without a complete rewrite of the PDF standard
>>  fidelity to the text is not really possible. PDF format wasn't designed
>> to do it.
>
> I might be wrong, but pdfTeX-generated documents work fine (after
> adding encoding vector) even though the glyphs populate "random" slots
> is the font (for example T1 encoding) that have nothing to do with
> Unicode.
>
It works with good fonts in good viewers because these "good fonts"
assign proper names to the glyphs. I tested this many years ago not
only in pdftex but also with tex + dvips + either ps2pdf from GS or
Adobe Distiller.

> It should be possible to do something similar in XeTeX/LuaTeX.
>
> I'm not saying that this would solve problems of copy-pasting Arabic
> scripts, but it should be possible to cover alternate glyphs for Latin
> scripts at least.
>
> Mojca
>
> PS: From 
> http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html
>
> There is an optional auxiliary structure called the "ToUnicode" table
> that was introduced into PDF to help with this text retrieval problem.
> A ToUnicode table can be associated with a font that does not normally
> have a way to determine the relationship between glyphs and Unicode
> characters (some do). The table maps strings of glyph identifiers into
> strings of Unicode characters, often just one to one, so that the
> proper character strings can be made from the glyph references in the
> file.
>
ToUnicode can only replace a byte with a sequence of bytes. Type1 font
can encode only 256 characters, therefore such mapping is possible.
Many years ago I developed a ToUnicode map for Velthuis Devanagari:
http://icebearsoft.euweb.cz/dvngpdf/
Complex scripts would require many-to-many mapping but it is
impossible with toUnicode.
>
> --
> Subscriptions, Archive, and List information, etc.:
>   http://tug.org/mailman/listinfo/xetex



-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] how to do (better) searchable PDFs in xelatex?

2012-10-15 Thread Mojca Miklavec
On Mon, Oct 15, 2012 at 10:32 AM, Joe Corneli wrote:
>
> but fi => fi, despite the latter
> copy-pasting as "fi".  Somehow this does seem like it's just an
> oversight on the part of the font developers.

It can also be an oversight on the part of PDF viewer developers.
Apple decomposes all accented Latin characters for example ("C"
followed by "composing caron" instead of just "Č"). I always found
that horribly annoying. On the other hand it had zero problems with
infinity, other math symbols and Greek letters from pdfTeX-generated
documents, so I usually had no problems copy-pasting mathematical
formulas. I only had to add an extra pair of dollars to get a nicely
formatted formula.

You should try "pdftotext", Adobe Acrobat, Apple's Preview (if you
have access to it), some free viewers, ... and the results are often
different. In my opinion any decent PDF viewer should be able to
convert the "fi" ligature into two separate letters when copy-pasting.
This cannot be font designer's fault.

Mojca

PS: It is 2012 ... and since a couple of recent months Opera (web
browser) still fails to display the most basic accented Latin
characters (like š & ž) even when page encoding is properly set. Yes,
it's unbelievable.



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] how to do (better) searchable PDFs in xelatex?

2012-10-15 Thread Joe Corneli
On Mon, Oct 15, 2012 at 9:13 AM, Mojca Miklavec
 wrote:

> I might be wrong, but pdfTeX-generated documents work fine (after
> adding encoding vector) even though the glyphs populate "random" slots
> is the font (for example T1 encoding) that have nothing to do with
> Unicode.
>
> It should be possible to do something similar in XeTeX/LuaTeX.
>
> I'm not saying that this would solve problems of copy-pasting Arabic
> scripts, but it should be possible to cover alternate glyphs for Latin
> scripts at least.

Sounds like a possible solution to my problem...  and indeed, this MWE
is searchable --

\documentclass{book}
\usepackage{libertine}
\begin{document}
Quantitative/Prefix.
\end{document}

... without the Qu =>  ligature, but fi => fi, despite the latter
copy-pasting as "fi".  Somehow this does seem like it's just an
oversight on the part of the font developers.  Indeed, it seems this
is already noted in their bug tracker:

http://sourceforge.net/tracker/index.php?func=detail&aid=3575137&group_id=89513&atid=590374

I guess there's little else to do but wait until that's fixed.



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] how to do (better) searchable PDFs in xelatex?

2012-10-15 Thread Mojca Miklavec
On Mon, Oct 15, 2012 at 12:04 AM, Andrew Cunningham wrote:
>
> This is the nature of the PDF format. It is a preprint format the focuses on
> glyphs rather than  characters
>
> It partly depends on the font, and the OT features being used.
>
> In theory you can have ActualText in the PDF, but once you move to complex
> scripts all bets are off. Without a complete rewrite of the PDF standard
>  fidelity to the text is not really possible. PDF format wasn't designed
> to do it.

I might be wrong, but pdfTeX-generated documents work fine (after
adding encoding vector) even though the glyphs populate "random" slots
is the font (for example T1 encoding) that have nothing to do with
Unicode.

It should be possible to do something similar in XeTeX/LuaTeX.

I'm not saying that this would solve problems of copy-pasting Arabic
scripts, but it should be possible to cover alternate glyphs for Latin
scripts at least.

Mojca

PS: From http://blogs.adobe.com/insidepdf/2008/07/text_content_in_pdf_files.html

There is an optional auxiliary structure called the "ToUnicode" table
that was introduced into PDF to help with this text retrieval problem.
A ToUnicode table can be associated with a font that does not normally
have a way to determine the relationship between glyphs and Unicode
characters (some do). The table maps strings of glyph identifiers into
strings of Unicode characters, often just one to one, so that the
proper character strings can be made from the glyph references in the
file.


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex