Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Ross Moore
Hi Zdenek, and others,

On 01/01/2014, at 11:53, Zdenek Wagner zdenek.wag...@gmail.com wrote:

 The attached file (produced using pdfTeX, not XeTeX) is an example
 that I've used in TUG talks, and elsewhere.
 Try copy/paste of portions of the mathematics. Be aware that you can
 get different results depending upon the PDF viewer used when
 extracting the text.  (The file has uncompressed streams, so you
 can view it in a decent text editor to see the tagging structures
 used within the PDF content.)
 
 If I remember it well, ActualString supports only bytes, not
 cotepoints. Thus accfented characters cannot be encoded, neither Indic
 characters.

I don't know what you mean by this.
In my testing I can tag pretty-much any piece of content, and map it to any 
string using /ActualText .
Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with it,
modulo some bugs that have been reported when using very long replacement 
strings.

In the example PDF that I attached to my previous message, each mathematical 
character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1 
alphanumerics expressed using surrogate pairs. 

I see no reason why Indic character strings could not be done similarly.
You probably need some on-the-fly preprocessing to work out the required 
strings to use.
This is certainly possible, and is what I do with mathematical expressions.
It should be possible to do it entirely within TeX, but the programming can get 
very tricky, so I use Perl instead.

 ToUnicode supports one byte to many bytes, not many bytes
 to many bytes.

Exactly. This is why /ActualText  is the structure to use.


 Indic scripts use reordering where a matra precedes the
 consonants or some scripts contain two-piece matras. Unless the
 specification was corrected the ToUnicode map is unable to handle the
 Indic scritps properly.

Agreed;  /ToUnicode  is not what is needed here.
This sounds like precisely the kind of situation where you want to tag an 
extended block of content and use /ActualText  to map it to a pre-constructed 
Unicode string.
I'm no expert in Indic languages, so cannot provide specific details or 
examples.


 
 --
 Regards,
 Alexey Kryukov anagnost at yandex dot ru
 
 Moscow State University
 Faculty of History
 
 
 
 Hope this helps,
 
Ross

 -- 
 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz

Happy New Year,


Ross

--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Khaled Hosny
On Wed, Jan 01, 2014 at 10:07:54PM +1100, Ross Moore wrote:
  ToUnicode supports one byte to many bytes, not many bytes
  to many bytes.
 
 Exactly. This is why /ActualText  is the structure to use.

My only issue with /ActualText is that using it to tag whole words
breaks fine text selection (one can not select individual characters
inside these words and searching for one character will highlight the
whole word containing it). Otherwise it is the most versatile mechanism
to preserve original text in PDF files.

Because of that, I think a better strategy is to use /ToUnicode mapping
whenever applicable and resort to /ActualText text for the problematic
cases, namely one to many substitutions, reordering and different
substitutions leading to the same glyph (though the last one can be
handled by duplicating the glyph under different name/encoding when
subsetting the font).

The situation in XeTeX is more complex because the typesetting (where
the original text string is known) is done in XeTeX, while the PDF
generation is done by the PDF driver and the communication channel
between both (XDV files) passes only glyph ids not the original text
strings, so we can only rely on font encodings and glyph names (or try
to guess glyph names from by examining simple font substitutions in the
upcoming patch).

Regards,
Khaled


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Jonathan Kew

On 1/1/14 11:49, Khaled Hosny wrote:


The situation in XeTeX is more complex because the typesetting (where
the original text string is known) is done in XeTeX, while the PDF
generation is done by the PDF driver and the communication channel
between both (XDV files) passes only glyph ids not the original text
strings


I'd suggest that the best way forward here would be to modify xetex such 
that it includes the original Unicode text in the xdv stream, as well as 
the positioned glyphs. Then the driver can write a correct ActualText 
for each word.


There'd be some performance cost to this, of course; the inclusion of 
the Unicode text could be an optional feature, so that people who just 
want a throwaway pdf in order to print a document don't have to suffer 
slower generation and/or larger files.


This wouldn't address all the problems with pdf text extraction; 
higher-level issues of text structure and flow would still be tricky in 
the case of documents with any complex layout. But at least the basic 
Unicode characters making up each word would be reliably correct.


JK



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Zdenek Wagner
2014/1/1 Ross Moore ross.mo...@mq.edu.au:
 Hi Zdenek, and others,

 On 01/01/2014, at 11:53, Zdenek Wagner zdenek.wag...@gmail.com wrote:

 The attached file (produced using pdfTeX, not XeTeX) is an example

 that I've used in TUG talks, and elsewhere.

 Try copy/paste of portions of the mathematics. Be aware that you can

 get different results depending upon the PDF viewer used when

 extracting the text.  (The file has uncompressed streams, so you

 can view it in a decent text editor to see the tagging structures

 used within the PDF content.)


 If I remember it well, ActualString supports only bytes, not
 cotepoints. Thus accfented characters cannot be encoded, neither Indic
 characters.


 I don't know what you mean by this.
 In my testing I can tag pretty-much any piece of content, and map it to any
 string using /ActualText .
 Mostly I use Adobe's Acrobat Pro as the PDF reader, and this works fine with
 it,
 modulo some bugs that have been reported when using very long replacement
 strings.

 In the example PDF that I attached to my previous message, each mathematical
 character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1
 alphanumerics expressed using surrogate pairs.

Thank you, now I see it. The book where I read about /ActualText did
not mention that I can use UTF16 if I start the string with BOM. Can I
see the source of the PDF? It could help me much to see how you do all
these things.

 I see no reason why Indic character strings could not be done similarly.
 You probably need some on-the-fly preprocessing to work out the required
 strings to use.
 This is certainly possible, and is what I do with mathematical expressions.
 It should be possible to do it entirely within TeX, but the programming can
 get very tricky, so I use Perl instead.

 ToUnicode supports one byte to many bytes, not many bytes
 to many bytes.


 Exactly. This is why /ActualText  is the structure to use.


 Indic scripts use reordering where a matra precedes the
 consonants or some scripts contain two-piece matras. Unless the
 specification was corrected the ToUnicode map is unable to handle the
 Indic scritps properly.


 Agreed;  /ToUnicode  is not what is needed here.
 This sounds like precisely the kind of situation where you want to tag an
 extended block of content and use /ActualText  to map it to a
 pre-constructed Unicode string.
 I'm no expert in Indic languages, so cannot provide specific details or
 examples.



 --

 Regards,

 Alexey Kryukov anagnost at yandex dot ru


 Moscow State University

 Faculty of History




 Hope this helps,


Ross


 --

 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz


 Happy New Year,


 Ross



 --
 Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Ross Moore
Hi Zdeněk,

On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote:

 2014/1/1 Ross Moore ross.mo...@mq.edu.au:

 In the example PDF that I attached to my previous message, each mathematical
 character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1
 alphanumerics expressed using surrogate pairs.
 
 Thank you, now I see it. The book where I read about /ActualText did
 not mention that I can use UTF16 if I start the string with BOM.

Fair enough; this I had to discover for myself.
The PDF Reference Manual (e.g. for ISO 32000) has no such examples,
so I had to experiment with different ways to specify strings requiring
non-ascii characters. UTF16 is the most elegant, and avoids the messiness
of using escape characters and octal codes, even for some non-letter
ASCII characters.

 Can I
 see the source of the PDF? It could help me much to see how you do all
 these things.

Each piece of mathematics is captured, saved to a file, converted to MathML,
then run through my Perl script to create alternative (La)TeX source.
This is done to be able to create a fully-tagged PDF description of the 
mathematical content, using a special version of  pdftex  that Han The Thanh
created for me (and others) --- still in experimental stage.

You should not need all of this machinery, but I'm happy to answer
any questions you may have.

I've attached a couple of examples of the output from my Perl script, 
in which you can see how the /ActualText  replacement strings
are specified, using a macro \SMC — which ultimately expands to use
the  \pdfstartmarkedcontent  primitive.



2013-Assign2-soln-inline-2-tags.tex
Description: Binary data


2013-Assign2-soln-inline-1-tags.tex
Description: Binary data


Without the special primitives, you should be able to use  \pdfliteral 
to insert the tagging needed for just using  /ActualText .

 
 I see no reason why Indic character strings could not be done similarly.
 You probably need some on-the-fly preprocessing to work out the required
 strings to use.


I'm not sure whether there is a LaTeX package that allows you to get the
literal bits into the correct place without upsetting other fine
details of the typesetting with Indic characters.
This certainly should be possible, at least when using  pdfLaTeX .
Not sure of the details using XeTeX — but you work with the source code,
so can devise anything that is needed, right?

 
 -- 
 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz



Hope this helps,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-206  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114


inline: logo.png


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] turn off special characters in PDF

2014-01-01 Thread Zdenek Wagner
2014/1/1 Ross Moore ross.mo...@mq.edu.au:
 Hi Zdeněk,

 On 02/01/2014, at 2:14 AM, Zdenek Wagner wrote:

 2014/1/1 Ross Moore ross.mo...@mq.edu.au:

 In the example PDF that I attached to my previous message, each mathematical
 character is mapped to a big-endian UTF-16 hexadecimal string, with Plane-1
 alphanumerics expressed using surrogate pairs.

 Thank you, now I see it. The book where I read about /ActualText did
 not mention that I can use UTF16 if I start the string with BOM.

 Fair enough; this I had to discover for myself.
 The PDF Reference Manual (e.g. for ISO 32000) has no such examples,
 so I had to experiment with different ways to specify strings requiring
 non-ascii characters. UTF16 is the most elegant, and avoids the messiness
 of using escape characters and octal codes, even for some non-letter
 ASCII characters.

 Can I
 see the source of the PDF? It could help me much to see how you do all
 these things.

 Each piece of mathematics is captured, saved to a file, converted to MathML,
 then run through my Perl script to create alternative (La)TeX source.
 This is done to be able to create a fully-tagged PDF description of the
 mathematical content, using a special version of  pdftex  that Han The Thanh
 created for me (and others) --- still in experimental stage.

 You should not need all of this machinery, but I'm happy to answer
 any questions you may have.

 I've attached a couple of examples of the output from my Perl script,
 in which you can see how the /ActualText  replacement strings
 are specified, using a macro \SMC -- which ultimately expands to use
 the  \pdfstartmarkedcontent  primitive.


Thank you.


 Without the special primitives, you should be able to use  \pdfliteral
 to insert the tagging needed for just using  /ActualText .


 I see no reason why Indic character strings could not be done similarly.
 You probably need some on-the-fly preprocessing to work out the required
 strings to use.


 I'm not sure whether there is a LaTeX package that allows you to get the
 literal bits into the correct place without upsetting other fine
 details of the typesetting with Indic characters.
 This certainly should be possible, at least when using  pdfLaTeX .
 Not sure of the details using XeTeX -- but you work with the source code,
 so can devise anything that is needed, right?

Typesetting depends on HarfBuzz and font features, no package is
needed (fontspec and polyglossia just save work that could be done by
primitives), any code can be sent to xdvipdfmx by \special{pdf: code
...} similarly as by \pdfliteral in pdftex. I already know how to do
it.


 --
 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz



 Hope this helps,

 Ross

 
 Ross Moore   ross.mo...@mq.edu.au
 Mathematics Department   office: E7A-206
 Macquarie University tel: +61 (0)2 9850 8955
 Sydney, Australia  2109  fax: +61 (0)2 9850 8114
 






 --
 Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex