[Bug 1778988] [NEW] After PDF Files created by cups-pdf, cannot extract text from them

Bob Swanson Wed, 27 Jun 2018 12:51:43 -0700

Public bug reported:

PDF Creation Problem


Bob Swanson
[email protected]
26 June 2018

This file is part of the test package:

http://swansongrp.com/misc/testcase.zip


I have been able to demonstrate PDF
printing issues with LibreOffice and web 
browsers. (For contrast, I have also
used the "wkhtmltopdf" command-line utility
output.)


USING LIBREOFFICE
-----------------

(Base file: mytest.odt)

This problem was originally associated
with a LibreOffice file containing mixed
font usages. When printed with "cups-pdf",
most of the displayed text could not be
selected in "evince" and extracted text was 
garbage (The PDFBox Java code could not extract
reasonable text from the PDF file.) See:

https://issues.apache.org/jira/browse/PDFBOX-4250

(working environment described in that
bug report)

It is much easier to demonstrate this
problem without using PDFBox Java.

To demonstrate, view any of the resulting PDF
files with "evince". When viewing a PDF
test file, simply press CTRL-A to select all 
text. Then "paste" the selected area into a text 
editor (Gedit or VIM, for instance) to see the 
resulting plain text. (Okular did no better
than evince)

The following results occur:

o) With original testcase: mytest3_cups_pdf.pdf,
created from LibreOffice using cups-pdf

Only one line highlights, but its text is correct.
This particular line uses a "standard" PDF font. The 
other lines are not highlighted, and are not placed on 
clipboard. This was the test that failed in PDFBox Java 
text extraction.

Evince shows the many embedded fonts.


o) With original testcase: mytest_libreoffice_direct.pdf
created from LibreOffice using its "built in" PDF
creation option

ALL lines highlight, and when pasted, all text is
present.

Evince shows the many embedded fonts.


USING CHROMIUM BROWSER
----------------------

(Base file: mytest.html)

I created several lines using different fonts,
as an HTML file. Viewed in Chromium browser,
then printed.
 
o) File: mytest_html_cups_pdf.pdf,
was printed from Chromium using the "cups-pdf"
"printer". All lines appear in the PDF, and
can be selected. But when pasted all resulting text 
is garbage. Only one font embedded: "No name".

o) File: mytest_html_save_as_pdf.pdf,
was printed from Chromium using the "save as file"
option. All lines appear in the PDF, and can be 
selected. All text (including text added by
the PDF creator) are present.

Evince shows the many embedded fonts.

(In the HTML cases, fonts used are no doubt
those already installed on my Ubuntu system.
The HTML code asked for fonts that may not
be present, and probably were substituted.)


USING BRAVE BROWSER
-------------------

(Base file: mytest.html)

Same testcase as for Chromium browser. Viewed
in Brave browser, then printed.
 
o) File: mytest_html_brave_cups_pdf.pdf,
was printed from Brave using the "cups-pdf"
"printer". All lines appear in the PDF. However,
when all selected, every character is highlighted
EXCEPT the initial "T" on the first line.  When 
pasted, all resulting text is garbage. Only one 
font embedded: "No name".

o) File: mytest_html_brave_save_as_pdf.pdf,
was printed from Brave using the "save as file"
option. All lines appear in the PDF, and can be 
selected. All text is present. (No text added
by Brave). 

Evince shows the many embedded fonts.

(Same notes may apply regarding fonts installed
on Ubuntu system)


USING WKHTMLTOPDF COMMAND
-------------------------

(Base file: mytest.html)

Same testcase as for browsers. I'm using
this example to show that multiple font
test output can be created in different ways.

Command:

wkhtmltopdf mytest.html mytest_wk.pdf
 
o) File: mytest_wk.pdf,
All lines appear in the PDF, and fonts
are sometimes quite different than those shown in
the browsers (they may actually be more correct). 

All content can be highlighted and can be pasted
as text. Several text lines, however, contain
additional whitespace (tabs).

Evince shows the many embedded fonts.


NOTES
-----

The "creator" name embedded in the metadata for
these PDF files varies considerably, and it is
unclear to me whether the same engine is being
used by these various packages. It is clear, at
least that cups-pdf is using Ghostscript for
PDF creation.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: cups-pdf 2.6.1-21
ProcVersionSignature: Ubuntu 4.13.0-45.50~16.04.1-generic 4.13.16
Uname: Linux 4.13.0-45-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Jun 27 15:36:32 2018
InstallationDate: Installed on 2017-05-16 (406 days ago)
InstallationMedia: Ubuntu 16.04.2 LTS "Xenial Xerus" - Release amd64 
(20170215.2)
SourcePackage: cups-pdf
UpgradeStatus: No upgrade log present (probably fresh install)

** Affects: cups-pdf (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: amd64 apport-bug xenial

** Attachment added: "ZIP file with README and several testcases"
   
https://bugs.launchpad.net/bugs/1778988/+attachment/5157183/+files/testcase.zip

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1778988

Title:
  After PDF Files created by cups-pdf, cannot extract text from them

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cups-pdf/+bug/1778988/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1778988] [NEW] After PDF Files created by cups-pdf, cannot extract text from them

Reply via email to