Re: [XeTeX] potential new feature: \XeTeXgenerateactualtext

2016-02-24 Thread Ross Moore
Hi Will,

> On Feb 25, 2016, at 5:19 PM, Will Robertson  wrote:
> 
> Hi Ross,
> 
> Great to hear from you.
> I thought of you straight away when writing my email :)
> 
> 
>> On 25 Feb 2016, at 11:35 AM, Ross Moore  wrote:
>> 
>> You have to be *very* careful with /ActualText, since it must be done using 
>> PDFdoc encoding, 
>> as it becomes part of the page contents stream.
>> Any errors will corrupt the PDF file completely — but that’s true of other 
>> things as well.
>> Heiko’s  \pdfstringdef  in the hyperref package is very good for handling 
>> this…
> 
> That’s good to know, thanks.
> I think there has been *some* work by one or two of the LaTeX3 members on 
> general methods for this sort of thing, but it’s been a while.

Send me their names.
I may have a bit more time this year.


>> Look at some of my papers associated with TUG conferences, to see various
>> options that can be used to make mathematics more accessible in PDFs; i.e.,
>> papers numbered as 5, 6, 7 on this page: 
>> 
>>  http://www.tug.org/twg/accessibility/
>> 
>> Although these were done using pdfTeX, some of these things should be able
>> to be implemented for XeTeX + xdvipdfmx  also.
> 
> This is exactly where I was going with all this (so we’re getting quite far 
> away from the new primitive).
> My understanding is that the extended pdfTeX you were using was included in 
> TeX Live 2015, is that right? Or will be in TL2016?

The later papers, which are not directly on “Tagged PDF”, don’t require
the special tagging features.

> How much work would it be to translate that work into something that will 
> also function in XeTeX?

That depends on how easy it is to create PDF objects and object references
between them.
Since I don’t know how  xdvipdfmx does it — using pdfmark ?  as does dvips ?
then it’s nowhere near as convenient as with pdfTeX.

Hopefully someone with the necessary experience can pick up on those ideas.
That’s why I’ve followed up your comment on this list.
Indeed, we need someone to get  pdfx.sty  working with XeLaTeX;
it’s for similar reasons that it doesn’t do so already.

Switch it to another thread, if you think that is appropriate.

> Cheers,
> Will

Cheers,

Ross






--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] potential new feature: \XeTeXgenerateactualtext

2016-02-24 Thread Akira Kakuto
The code for the \XeTeXgenerateactualtext feature (it's an integer 
parameter; set it to 1 to get ActualText added to the PDF, for better 
copy/paste and search in Acrobat) is now on sourceforge, in an 
"actualtext" branch, for anyone who wants to try building and 
experimenting with it.


Windows 32bit binary for tests based on Jonathan's 845506
is available in:
http://members2.jcom.home.ne.jp/wt1357ak/xetex-ac-txt.zip

I'll remove the file in due time.

Best,
Akira



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] potential new feature: \XeTeXgenerateactualtext

2016-02-24 Thread Will Robertson
Hi Ross,

Great to hear from you.
I thought of you straight away when writing my email :)


> On 25 Feb 2016, at 11:35 AM, Ross Moore  wrote:
> 
> You have to be *very* careful with /ActualText, since it must be done using 
> PDFdoc encoding, 
> as it becomes part of the page contents stream.
> Any errors will corrupt the PDF file completely — but that’s true of other 
> things as well.
> Heiko’s  \pdfstringdef  in the hyperref package is very good for handling 
> this…

That’s good to know, thanks.
I think there has been *some* work by one or two of the LaTeX3 members on 
general methods for this sort of thing, but it’s been a while.


>> This sounds interesting for maths, where there is a chance we could 
>> automatically insert \special{}s at the glyph and/or the equation level — 
>> has this always been possible in XeTeX or does this require the newest patch 
>> for xdvipdfmx you just released?
> 
> … but doing the math-characters correctly, without interfering with spacings, 
> is highly non-trivial.

I have no doubt!!


> Look at some of my papers associated with TUG conferences, to see various
> options that can be used to make mathematics more accessible in PDFs; i.e.,
> papers numbered as 5, 6, 7 on this page: 
> 
>   http://www.tug.org/twg/accessibility/
> 
> Although these were done using pdfTeX, some of these things should be able
> to be implemented for XeTeX + xdvipdfmx  also.

This is exactly where I was going with all this (so we’re getting quite far 
away from the new primitive).
My understanding is that the extended pdfTeX you were using was included in TeX 
Live 2015, is that right? Or will be in TL2016?

How much work would it be to translate that work into something that will also 
function in XeTeX?

Cheers,
Will





--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] potential new feature: \XeTeXgenerateactualtext

2016-02-24 Thread ShreeDevi Kumar
Jonathan,

This is a really useful feature and I look forward to using it once it is
released in TLY2016.

Since how well the search and copy paste features work could also be font
dependent, I would like to test some more PDFs in unicode devanagari
created by this new feature using other fonts. I usually use Siddhanta and
Sanskrit2003 font.

I would appreciate if you or other members who have this feature installed
can provide a few more sample PDFs in devanagari  for testing.

Thanks!

- sent from my phone. excuse the brevity.
On 24-Feb-2016 3:37 pm, "Jonathan Kew"  wrote:

> On 24/2/16 09:22, ShreeDevi Kumar wrote:
>
>> Testing dev-actualtext.pdf sent by JK
>>
>>   * Adobe Acrobat Reader XI on Windows 10
>>   o Does not highlight text fully
>>   o SEARCH finds words and word parts correctly but usually
>> highlights only beginning of the word containing the letter
>>   o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
>>   o Save as TXT file does not work correctly - only saves ... in it,
>> not the actual unicode text which can be copied
>>
>
> So it looks like Acrobat makes use of the ActualText for Search and Copy,
> but sadly its "Save as Text" doesn't support Unicode.
>
> I'm pleasantly surprised to see the Gmail previewer also handles it.
>
> The others (Foxit, Edge) sound like they're just working from the glyph
> stream, which is basically doomed to failure.
>
> For a further data point, I tried Evince (Document Viewer) on Ubuntu
> 15.10, and found that Copy and Search work well; it looks like it is using
> the ActualText correctly. This is thanks to the poppler library, I believe.
> The (poppler-based) "pdftotext" tool was also able to extract the Unicode
> text correctly from the PDF, although "pdftohtml" didn't do so well.
>
> One issue with Evince is that drag-selecting text to highlight it (as for
> Copy/Paste) looks bad: the highlighting completely obscures the selected
> text, although it will end up being copied correctly. Interestingly, its
> highlighting of search results doesn't suffer from this problem, and it
> even makes a fair attempt (not completely accurate) at highlighting
> specific letters within a word, not just entire words.
>
> JK
>
>
>   * Foxit Reader 7.3 on Windows 10
>>   o Highlights text fully,
>>   o smallest highlight unit is word,
>>   o COPY paste to notepad++ as well as SEARCH does NOT work
>> correctly as Unicode text is not fully correct.
>>
>> ूय
>>
>> िनकोड क्या ह ? ै
>>
>>   o
>> ​Save as TXT file does not work correctly - saves the unicode
>> text with same problems as in copy and paste​
>>
>>   *
>> ​Microsoft Edge Viewer on Windows 10
>>   o
>> ​
>> Highlights text fully,
>>   o COPY paste to notepad++ as well as SEARCH does NOT work
>> correctly as Unicode text is not fully correct.
>>
>> य ूिनकोड क्या है?
>>
>>   *
>> ​
>> Previewing from within gmail in Chrome on Windows 10 -
>>   o Highlights text fully,
>>   o smallest highlight unit is word,
>>   o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
>>   o (highlights only first letter of first word in
>> paragraph यू rather than full word यूनिकोड)
>>   o there is NO SEARCH feature
>>   o there is no save as TXT file feature
>>   * Same as above while Previewing from within gmail in Internet
>> Explorer on Windows 10
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Feb 23, 2016 at 11:30 PM, Jonathan Kew > > wrote:
>>
>> On 23/2/16 17:39, Philip Taylor wrote:
>>
>> Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1
>> allows
>> me to select only half of the text whereas Adobe Reader DC
>> allows me to
>> select it all; neither allows me to select individual kanji.
>>
>>
>> Ah, right... as there are no spaces between the kanji, they'll end
>> up in the same text object. That's a shortcoming of how the current
>> implementation works, for scripts that don't use inter-word spaces.
>>
>> In either case, copy&paste actually gives you the whole text, even
>> though AAPro only highlights half of it, I guess?
>>
>> JK
>>
>>
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>> http://tug.org/mailman/listinfo/xetex
>>
>>
>>
>>
>>
>>
>> --
>> Subscriptions, Archive, and List information, etc.:
>>http://tug.org/mailman/listinfo/xetex
>>
>>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive,

Re: [XeTeX] potential new feature: \XeTeXgenerateactualtext

2016-02-24 Thread Ross Moore
Hi Will, Jonathan, and others

> On Feb 25, 2016, at 10:31 AM, Will Robertson  wrote:
> 
> On 24 Feb 2016, at 2:20 AM, Jonathan Kew  wrote:
>> 
>> For a document that wants some other kind of "ActualText", there's going to 
>> need to be pretty detailed markup in the source, I think. (E.g. each word, 
>> or similar unit, will need to be tagged to provide the desired ActualText 
>> that goes with it.) At that point, I wonder if turning off 
>> \XeTeXgenerateactualtext and just doing it "manually" with macros that 
>> generate \special{}s would be the most reasonable way forward.
> 

You have to be *very* careful with /ActualText, since it must be done using 
PDFdoc encoding, 
as it becomes part of the page contents stream.
Any errors will corrupt the PDF file completely — but that’s true of other 
things as well.
Heiko’s  \pdfstringdef  in the hyperref package is very good for handling 
this...

> This sounds interesting for maths, where there is a chance we could 
> automatically insert \special{}s at the glyph and/or the equation level — has 
> this always been possible in XeTeX or does this require the newest patch for 
> xdvipdfmx you just released?

 … but doing the math-characters correctly, without interfering with spacings, 
is highly non-trivial.

Look at some of my papers associated with TUG conferences, to see various
options that can be used to make mathematics more accessible in PDFs; i.e.,
papers numbered as 5, 6, 7 on this page: 

   http://www.tug.org/twg/accessibility/

Although these were done using pdfTeX, some of these things should be able
to be implemented for XeTeX + xdvipdfmx  also.


> 
> Cheers,
> Will


Cheers,

Ross




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] potential new feature: \XeTeXgenerateactualtext

2016-02-24 Thread Will Robertson
On 24 Feb 2016, at 2:20 AM, Jonathan Kew  wrote:
> 
> For a document that wants some other kind of "ActualText", there's going to 
> need to be pretty detailed markup in the source, I think. (E.g. each word, or 
> similar unit, will need to be tagged to provide the desired ActualText that 
> goes with it.) At that point, I wonder if turning off 
> \XeTeXgenerateactualtext and just doing it "manually" with macros that 
> generate \special{}s would be the most reasonable way forward.

This sounds interesting for maths, where there is a chance we could 
automatically insert \special{}s at the glyph and/or the equation level — has 
this always been possible in XeTeX or does this require the newest patch for 
xdvipdfmx you just released?

Cheers,
Will





--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] potential new feature: \XeTeXgenerateactualtext

2016-02-24 Thread Jonathan Kew

On 24/2/16 09:22, ShreeDevi Kumar wrote:

Testing dev-actualtext.pdf sent by JK

  * Adobe Acrobat Reader XI on Windows 10
  o Does not highlight text fully
  o SEARCH finds words and word parts correctly but usually
highlights only beginning of the word containing the letter
  o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
  o Save as TXT file does not work correctly - only saves ... in it,
not the actual unicode text which can be copied


So it looks like Acrobat makes use of the ActualText for Search and 
Copy, but sadly its "Save as Text" doesn't support Unicode.


I'm pleasantly surprised to see the Gmail previewer also handles it.

The others (Foxit, Edge) sound like they're just working from the glyph 
stream, which is basically doomed to failure.


For a further data point, I tried Evince (Document Viewer) on Ubuntu 
15.10, and found that Copy and Search work well; it looks like it is 
using the ActualText correctly. This is thanks to the poppler library, I 
believe. The (poppler-based) "pdftotext" tool was also able to extract 
the Unicode text correctly from the PDF, although "pdftohtml" didn't do 
so well.


One issue with Evince is that drag-selecting text to highlight it (as 
for Copy/Paste) looks bad: the highlighting completely obscures the 
selected text, although it will end up being copied correctly. 
Interestingly, its highlighting of search results doesn't suffer from 
this problem, and it even makes a fair attempt (not completely accurate) 
at highlighting specific letters within a word, not just entire words.


JK



  * Foxit Reader 7.3 on Windows 10
  o Highlights text fully,
  o smallest highlight unit is word,
  o COPY paste to notepad++ as well as SEARCH does NOT work
correctly as Unicode text is not fully correct.

ूय

िनकोड क्या ह ? ै

  o
​Save as TXT file does not work correctly - saves the unicode
text with same problems as in copy and paste​

  *
​Microsoft Edge Viewer on Windows 10
  o
​
Highlights text fully,
  o COPY paste to notepad++ as well as SEARCH does NOT work
correctly as Unicode text is not fully correct.

य ूिनकोड क्या है?

  *
​
Previewing from within gmail in Chrome on Windows 10 -
  o Highlights text fully,
  o smallest highlight unit is word,
  o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
  o (highlights only first letter of first word in
paragraph यू rather than full word यूनिकोड)
  o there is NO SEARCH feature
  o there is no save as TXT file feature
  * Same as above while Previewing from within gmail in Internet
Explorer on Windows 10


ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Feb 23, 2016 at 11:30 PM, Jonathan Kew mailto:jfkth...@gmail.com>> wrote:

On 23/2/16 17:39, Philip Taylor wrote:

Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1
allows
me to select only half of the text whereas Adobe Reader DC
allows me to
select it all; neither allows me to select individual kanji.


Ah, right... as there are no spaces between the kanji, they'll end
up in the same text object. That's a shortcoming of how the current
implementation works, for scripts that don't use inter-word spaces.

In either case, copy&paste actually gives you the whole text, even
though AAPro only highlights half of it, I guess?

JK




--
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex






--
Subscriptions, Archive, and List information, etc.:
   http://tug.org/mailman/listinfo/xetex





--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] potential new feature: \XeTeXgenerateactualtext

2016-02-24 Thread ShreeDevi Kumar
Testing dev-actualtext.pdf sent by JK


   - Adobe Acrobat Reader XI on Windows 10
   - Does not highlight text fully
  - SEARCH finds words and word parts correctly but usually highlights
  only beginning of the word containing the letter
  - COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
  - Save as TXT file does not work correctly - only saves ... in it,
  not the actual unicode text which can be copied
   - Foxit Reader 7.3 on Windows 10
  - Highlights text fully,
  - smallest highlight unit is word,
  - COPY paste to notepad++ as well as SEARCH does NOT work correctly as
  Unicode text is not fully correct.

ूय

िनकोड क्या ह ? ै


   - ​Save as TXT file does not work correctly - saves the unicode text
  with same problems as in copy and paste​

  - ​Microsoft Edge Viewer on Windows 10
   - ​
  Highlights text fully,
  - COPY paste to notepad++ as well as SEARCH does NOT work correctly
  as Unicode text is not fully correct.

  य ूिनकोड क्या है?

  - ​
   Previewing from within gmail in Chrome on Windows 10 -
  - Highlights text fully,
  - smallest highlight unit is word,
  - COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
  - (highlights only first letter of first word in paragraph यू rather
  than full word यूनिकोड)
  - there is NO SEARCH feature
  - there is no save as TXT file feature
   - Same as above while Previewing from within gmail in Internet Explorer
   on Windows 10




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Feb 23, 2016 at 11:30 PM, Jonathan Kew  wrote:

> On 23/2/16 17:39, Philip Taylor wrote:
>
>> Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1 allows
>> me to select only half of the text whereas Adobe Reader DC allows me to
>> select it all; neither allows me to select individual kanji.
>>
>>
> Ah, right... as there are no spaces between the kanji, they'll end up in
> the same text object. That's a shortcoming of how the current
> implementation works, for scripts that don't use inter-word spaces.
>
> In either case, copy&paste actually gives you the whole text, even though
> AAPro only highlights half of it, I guess?
>
> JK
>
>
>
>
> --
> Subscriptions, Archive, and List information, etc.:
>  http://tug.org/mailman/listinfo/xetex
>


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex