Re: [XeTeX] turn off special characters in PDF
Hi Zdenek, and others, On 01/01/2014, at 11:53, Zdenek Wagner zdenek.wag...@gmail.com wrote: The attached file (produced using pdfTeX, not XeTeX) is an example that I've used in TUG talks, and elsewhere. Try copy/paste of portions of the mathematics. Be aware that you can get different results depending upon the PDF viewer used when extracting the text. (The file has uncompressed streams, so you can view it in a decent text editor to see the tagging structures used within the PDF content.) If I remember it well, ActualString supports only bytes, not cotepoints. Thus accfented characters cannot be encoded, neither Indic characters. I don't know what you mean by this. In my testing I can tag pretty-much any piece of content, and map it to any string using /ActualText . Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with it, modulo some bugs that have been reported when using very long replacement strings. In the example PDF that I attached to my previous message, each mathematical character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 alphanumerics expressed using surrogate pairs. I see no reason why Indic character strings could not be done similarly. You probably need some on-the-fly preprocessing to work out the required strings to use. This is certainly possible, and is what I do with mathematical expressions. It should be possible to do it entirely within TeX, but the programming can get very tricky, so I use Perl instead. ToUnicode supports one byte to many bytes, not many bytes to many bytes. Exactly. This is why /ActualText is the structure to use. Indic scripts use reordering where a matra precedes the consonants or some scripts contain two-piece matras. Unless the specification was corrected the ToUnicode map is unable to handle the Indic scritps properly. Agreed; /ToUnicode is not what is needed here. This sounds like precisely the kind of situation where you want to tag an extended block of content and use /ActualText to map it to a pre-constructed Unicode string. I'm no expert in Indic languages, so cannot provide specific details or examples. -- Regards, Alexey Kryukov anagnost at yandex dot ru Moscow State University Faculty of History Hope this helps, Ross -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Happy New Year, Ross -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
On Wed, Jan 01, 2014 at 10:07:54PM +1100, Ross Moore wrote: ToUnicode supports one byte to many bytes, not many bytes to many bytes. Exactly. This is why /ActualText is the structure to use. My only issue with /ActualText is that using it to tag whole words breaks fine text selection (one can not select individual characters inside these words and searching for one character will highlight the whole word containing it). Otherwise it is the most versatile mechanism to preserve original text in PDF files. Because of that, I think a better strategy is to use /ToUnicode mapping whenever applicable and resort to /ActualText text for the problematic cases, namely one to many substitutions, reordering and different substitutions leading to the same glyph (though the last one can be handled by duplicating the glyph under different name/encoding when subsetting the font). The situation in XeTeX is more complex because the typesetting (where the original text string is known) is done in XeTeX, while the PDF generation is done by the PDF driver and the communication channel between both (XDV files) passes only glyph ids not the original text strings, so we can only rely on font encodings and glyph names (or try to guess glyph names from by examining simple font substitutions in the upcoming patch). Regards, Khaled -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
On 1/1/14 11:49, Khaled Hosny wrote: The situation in XeTeX is more complex because the typesetting (where the original text string is known) is done in XeTeX, while the PDF generation is done by the PDF driver and the communication channel between both (XDV files) passes only glyph ids not the original text strings I'd suggest that the best way forward here would be to modify xetex such that it includes the original Unicode text in the xdv stream, as well as the positioned glyphs. Then the driver can write a correct ActualText for each word. There'd be some performance cost to this, of course; the inclusion of the Unicode text could be an optional feature, so that people who just want a throwaway pdf in order to print a document don't have to suffer slower generation and/or larger files. This wouldn't address all the problems with pdf text extraction; higher-level issues of text structure and flow would still be tricky in the case of documents with any complex layout. But at least the basic Unicode characters making up each word would be reliably correct. JK -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
2014/1/1 Ross Moore ross.mo...@mq.edu.au: Hi Zdenek, and others, On 01/01/2014, at 11:53, Zdenek Wagner zdenek.wag...@gmail.com wrote: The attached file (produced using pdfTeX, not XeTeX) is an example that I've used in TUG talks, and elsewhere. Try copy/paste of portions of the mathematics. Be aware that you can get different results depending upon the PDF viewer used when extracting the text. (The file has uncompressed streams, so you can view it in a decent text editor to see the tagging structures used within the PDF content.) If I remember it well, ActualString supports only bytes, not cotepoints. Thus accfented characters cannot be encoded, neither Indic characters. I don't know what you mean by this. In my testing I can tag pretty-much any piece of content, and map it to any string using /ActualText . Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with it, modulo some bugs that have been reported when using very long replacement strings. In the example PDF that I attached to my previous message, each mathematical character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 alphanumerics expressed using surrogate pairs. Thank you, now I see it. The book where I read about /ActualText did not mention that I can use UTF16 if I start the string with BOM. Can I see the source of the PDF? It could help me much to see how you do all these things. I see no reason why Indic character strings could not be done similarly. You probably need some on-the-fly preprocessing to work out the required strings to use. This is certainly possible, and is what I do with mathematical expressions. It should be possible to do it entirely within TeX, but the programming can get very tricky, so I use Perl instead. ToUnicode supports one byte to many bytes, not many bytes to many bytes. Exactly. This is why /ActualText is the structure to use. Indic scripts use reordering where a matra precedes the consonants or some scripts contain two-piece matras. Unless the specification was corrected the ToUnicode map is unable to handle the Indic scritps properly. Agreed; /ToUnicode is not what is needed here. This sounds like precisely the kind of situation where you want to tag an extended block of content and use /ActualText to map it to a pre-constructed Unicode string. I'm no expert in Indic languages, so cannot provide specific details or examples. -- Regards, Alexey Kryukov anagnost at yandex dot ru Moscow State University Faculty of History Hope this helps, Ross -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Happy New Year, Ross -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
Hi Zdeněk, On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote: 2014/1/1 Ross Moore ross.mo...@mq.edu.au: In the example PDF that I attached to my previous message, each mathematical character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 alphanumerics expressed using surrogate pairs. Thank you, now I see it. The book where I read about /ActualText did not mention that I can use UTF16 if I start the string with BOM. Fair enough; this I had to discover for myself. The PDF Reference Manual (e.g. for ISO 32000) has no such examples, so I had to experiment with different ways to specify strings requiring non-ascii characters. UTF16 is the most elegant, and avoids the messiness of using escape characters and octal codes, even for some non-letter ASCII characters. Can I see the source of the PDF? It could help me much to see how you do all these things. Each piece of mathematics is captured, saved to a file, converted to MathML, then run through my Perl script to create alternative (La)TeX source. This is done to be able to create a fully-tagged PDF description of the mathematical content, using a special version of pdftex that Han The Thanh created for me (and others) --- still in experimental stage. You should not need all of this machinery, but I'm happy to answer any questions you may have. I've attached a couple of examples of the output from my Perl script, in which you can see how the /ActualText replacement strings are specified, using a macro \SMC — which ultimately expands to use the \pdfstartmarkedcontent primitive. 2013-Assign2-soln-inline-2-tags.tex Description: Binary data 2013-Assign2-soln-inline-1-tags.tex Description: Binary data Without the special primitives, you should be able to use \pdfliteral to insert the tagging needed for just using /ActualText . I see no reason why Indic character strings could not be done similarly. You probably need some on-the-fly preprocessing to work out the required strings to use. I'm not sure whether there is a LaTeX package that allows you to get the literal bits into the correct place without upsetting other fine details of the typesetting with Indic characters. This certainly should be possible, at least when using pdfLaTeX . Not sure of the details using XeTeX — but you work with the source code, so can devise anything that is needed, right? -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-206 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 inline: logo.png -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
Re: [XeTeX] turn off special characters in PDF
2014/1/1 Ross Moore ross.mo...@mq.edu.au: Hi Zdeněk, On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote: 2014/1/1 Ross Moore ross.mo...@mq.edu.au: In the example PDF that I attached to my previous message, each mathematical character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 alphanumerics expressed using surrogate pairs. Thank you, now I see it. The book where I read about /ActualText did not mention that I can use UTF16 if I start the string with BOM. Fair enough; this I had to discover for myself. The PDF Reference Manual (e.g. for ISO 32000) has no such examples, so I had to experiment with different ways to specify strings requiring non-ascii characters. UTF16 is the most elegant, and avoids the messiness of using escape characters and octal codes, even for some non-letter ASCII characters. Can I see the source of the PDF? It could help me much to see how you do all these things. Each piece of mathematics is captured, saved to a file, converted to MathML, then run through my Perl script to create alternative (La)TeX source. This is done to be able to create a fully-tagged PDF description of the mathematical content, using a special version of pdftex that Han The Thanh created for me (and others) --- still in experimental stage. You should not need all of this machinery, but I'm happy to answer any questions you may have. I've attached a couple of examples of the output from my Perl script, in which you can see how the /ActualText replacement strings are specified, using a macro \SMC -- which ultimately expands to use the \pdfstartmarkedcontent primitive. Thank you. Without the special primitives, you should be able to use \pdfliteral to insert the tagging needed for just using /ActualText . I see no reason why Indic character strings could not be done similarly. You probably need some on-the-fly preprocessing to work out the required strings to use. I'm not sure whether there is a LaTeX package that allows you to get the literal bits into the correct place without upsetting other fine details of the typesetting with Indic characters. This certainly should be possible, at least when using pdfLaTeX . Not sure of the details using XeTeX -- but you work with the source code, so can devise anything that is needed, right? Typesetting depends on HarfBuzz and font features, no package is needed (fontspec and polyglossia just save work that could be done by primitives), any code can be sent to xdvipdfmx by \special{pdf: code ...} similarly as by \pdfliteral in pdftex. I already know how to do it. -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz Hope this helps, Ross Ross Moore ross.mo...@mq.edu.au Mathematics Department office: E7A-206 Macquarie University tel: +61 (0)2 9850 8955 Sydney, Australia 2109 fax: +61 (0)2 9850 8114 -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex -- Zdeněk Wagner http://hroch486.icpf.cas.cz/wagner/ http://icebearsoft.euweb.cz -- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex